## GERHARD WOHLGENANNT

**Learning Ontology Relations by Combining Corpus-Based Techniques and Reasoning on Data from Semantic Web Sources**

#### F O R S C H U N G S E R G E B N I S S E D E R WIRTSCHAFTSUNIVERSITÄT WIEN

#### GERHARD WOHLGENANNT

## **Learning Ontology Relations by Combining Corpus-Based Techniques and Reasoning on Data from Semantic Web Sources**

The manual construction of formal domain conceptualizations (ontologies) is labor-intensive. Ontology learning, by contrast, provides (semi-)automatic ontology generation from input data such as domain text. This thesis proposes a novel approach for learning labels of non-taxonomic ontology relations. It combines corpus-based techniques with reasoning on Semantic Web data. Corpusbased methods apply vector space similarity of verbs co-occurring with labeled and unlabeled relations to calculate relation label suggestions from a set of candidates. A meta ontology in combination with Semantic Web sources such as DBpedia and OpenCyc allows reasoning to improve the suggested labels. An extensive formal evaluation demonstrates the superior accuracy of the presented hybrid approach.

Gerhard Wohlgenannt is a senior researcher at the New Media Technology Department, MODUL University Vienna. He received his PhD from the Institute for Information Business at Vienna University of Economics and Business (WU). His research interests include ontology learning, text mining and the Semantic Web.

Learning Ontology Relations by Combining Corpus-Based Techniques and Reasoning on Data from Semantic Web Sources

## **Forschungsergebnisse der Wirtschaftsuniversitat Wien**

Band 44

# **Learning Ontology Relations by Combining Corpus-Based Techniques and Reasoning on Data from Semantic Web Sources**

#### **Bibliographic Information published by the Deutsche Nationalbibliothek**

The Deutsche Nationalbibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data is available in the internet at http://dnb.d-nb.de.

Open Access: The online version of this publication is published on www.peterlang.com and www.econstor.eu under the international Creative Commons License CC-BY 4.0. Learn more on how you can use and share this work: http://creativecommons. org/licenses/by/4.0.

This book is available Open Access thanks to the kind support of ZBW – Leibniz-Informationszentrum Wirtschaft.

> Cover design: Atelier Platen according to a design of Werner WeiBhappl.

University logo of the Vienna University of Economics and Business Administration. Printed with kind permission of the University.

Sponsored by the Vienna University of Economics and Business Administration.

ISBN 978-3-631-75384-2 (eBook) ISSN 1613-3056 ISBN 978-3-631-60651-3

© Peter Lang GmbH lnternationaler Verlag der Wissenschaften Frankfurt am Main 2011 All rights reserved.

All parts of this publication are protected by copyright. Any utilisation outside the strict limits of the copyright law, without the permission of the publisher, is forbidden and liable to prosecution. This applies in particular to reproductions, translations, microfilming, and storage and processing in electronic retrieval systems.

#### www.peterlang.de

## **Contents**




## **List of Figures**


Downloaded from PubFactory at 01/11/2019 05:40:36AM via free access

## **List of Tables**



## **Acknowledgements**

First of all, I would like to thank my supervisors, Prof. Wolfgang Panny and Prof. Arno Scharl, for providing the organizational support and a very stimulating work environment at the Institute for Information Business of the Vienna University of Economics and Business, in close collaboration with the Department of New Media Technology of MODUL University Vienna.

Over the last few years, I had the opportunity to do extensive research work with the members of several project teams. Many ideas that found their way into this thesis and a number of related publications grew out of these activities. In regards to the publications, I also wish to acknowledge the anonymous reviewers' feedback and valuable comments to improve the manuscripts.

It has been a pleasure to work with my current and former colleagues at the involved institutes: Heinz Lang, Johannes Liegl, Wei Liu, Roman Kern, Hans Mitlohner, Thomas Neidhart, Walter Rafelsberger, Arno Scharl, Hermann Stern, Kamran Ali Ahmad Syed, Albert Weichselbraun, and Dimitri Zibold. Particular thanks go to Arno Scharl and Albert Weichselbraun for suggesting numerous improvements in terms of content, style and structure of this thesis. Heinz Lang provided the Java source code for creating the Jena inference model, and a wrapper to access the Scarlet **APL** 

Financial support was provided by the Austrian Federal Ministry for Transport, Innovation and Technology via the FIT-IT Semantic Systems projects AVALON1, IDIOM2 and RAVEN3 .

I am grateful to my friends Cathrine Konopatsch, Robert Koehl and Isabell Handler for proof-reading parts of the thesis. Finally, I would like to thank my family for their long-term support.

<sup>1</sup>http://vvw.kmi.tugraz.at/research/projects/avalon

<sup>2</sup>http://vvw.idiom.at

<sup>:</sup> 1http: / /vvw. modul. ac. at/nmt/raven Gerhard Wohlgenannt - 978-3-631-75384-2 Downloaded from PubFactory at 01/11/2019 05:40:36AM via free access

## **Abbreviations**


## **Abstract**

Ontologies are formal and shared conceptualizations of domains of interest, and are a crucial ingredient to the Semantic Web and to knowledge-based applications. The manual construction of ontologies is a cumbersome and expensive undertaking, a lot of research effort has been invested developing methods to (semi-)automatically learn ontologies. In contrast to existing approaches to ontology learning, which are typically either applied to natural language text or to structured information sources, this doctoral thesis proposes a novel approach that combines corpus-based methods with knowledge extracted from Semantic Web sources for learning non-taxonomic relations in ontologies.

The corpus-based methods use vector space model similarities of verbs co-occurring with unlabeled and labeled relations to calculate relation label suggestions from an arbitrary but specified set of label candidates. The integration of additional semantics gained from reasoning on data from external sources such *as* DBpedia and OpenCyc links domain concepts to concepts from a meta ontology. This information from semantic inference and validation then helps to refine label suggestions generated by the corpus-based methods on the basis of ontological restrictions defined upon the meta ontology.

A formal evaluation presents the accuracy and average ranking precision of the proposed hybrid approach. It demonstrates the superior performance as compared to methods that solely rely on domain text data or those that only build upon reasoning on external structured data sources.

# **Chapter 1 Introduction**

Ontologies have emerged as an important area of research in the field of computer science [156] over the last decade. The number of international conferences and workshops devoted to the topic reflects this observation. Every knowledge-based system or knowledge-level agent is committed to some implicit or explicit conceptualization - an ontology is an explicit specification of a shared conceptualization [73]. There are a number of reasons for the development and application of ontologies, for example: to create and share common understanding of a specific domain in a group of people, to make domain assumptions explicit and actionable, to separate domain knowledge from operational knowledge, and to enable reuse of domain knowledge [126, 156].

The Semantic Web is an extension of the current World Wide Web, originally proposed by Burners-Lee [15]. It "provides a common framework that allows data to be shared and reused across application, enterprise, and community boundaries" 1. The Semantic Web depends strongly on the timely proliferation of ontologies [108] and requires a global consensus on the appropriate semantic structures ( domain ontologies) for representing any possible domain of knowledge [156].

Fast and easy engineering of ontologies is an important ingredient for the Semantic Web, as well as for many other applications that utilize ontologies. Although a lot of time and effort has been invested into methodologies for ontology engineering [184, 59, 126, 129, 64], the creation of a conceptualization for non-trivial domains remains a difficult and time-consuming task [37, 128]. A major challenge in ontology engineering is to develop domain models with significant domain coverage, but nevertheless meaningful and consistent generalizations. Furthermore, the evolution of domains results in a constant need for refinement of domain ontologies to ensure their usefulness.

Ontology engineering requires highly specialized manual effort [50], which is also the primary bottleneck and cost-driver. Automated approaches that learn ontologies from existing data would be the ideal solution to the problem. Many researchers have attempted to learn ontologies from natural language text, as there is an abundant supply of this source of input data. Although the correctness and consistency of automatically generated ontologies cannot be guaranteed, which makes human postprocessing definitely necessary [37], automated approaches improve the productivity of ontology engineers and reduce human input required.

#### **Problem Statement**

The labeling of non-taxonomic relations between concepts is one of the main tasks in ontology learning [108], and it is considered a particularly challenging undertaking [95]. In order to establish the niche for the present work, the thesis provides on overview of the state of the art in this research field. The overview is limited to some selected examples for the sake of brevity, for a more detailed introduction to related literature see Section 4.3.

Many approaches in relation detection focus on specific types of relations, such as causal relations [67], the identification of meronyms [14, 66], telic and agentive relations [197], or the learning of qualia structures [41]. The mentioned work mostly relies on lexico-syntactic patterns in the tradition of Hearst [82], other methods apply machine learning techniques, for example Zelenko et al. [202] to extract relations like person-affiliation, or Poesio et al. [132] for the acquisition of feature norms. In contrast to methods that extract specific relations, domain-independent approaches related to the open information extraction paradigm [52] collect relations with unknown identifiers with a focus on scalability, based for example on huge text corpora [10], table structures on the Web [26], or the Deep Web [27].

Methods that acquire arbitrary relations for a typically limited or even predefined set of relation types tackle a very similar problem as do the methods presented in this doctoral thesis. SemEval 2007, an NLP workshop, included a task on the classification of semantic relations between nominals, where many participants combined techniques from machine learning and natural language processing [124, 11, 69, 125]. Rote extractors allow the automatic learning of extraction patterns for arbitrary relations upon training data [21, 7, 148]. Other text-based methods include work on extracting highly significant verbs as relation labels [95] from domain text with probabilistic measures, approaches that leverage parsing techniques [35, 144, 142], or the application of Web statistics and Web corpora in learning non-taxonomic relations [156]. Gerhard Wohlgenannt - 978-3-631-75384-2

More recently some authors applied Semantic Web datasets and ontologies for relation detection, for example in a method for ontology construction by cutting and pasting ontology modules [4]. Other approaches to discover relations anchor the respective concepts in background ontologies [6] or in ontologies found on the Semantic Web on-the-fly [152]. Those techniques currently suffer from low recall due to a lack of appropriate domain ontologies available. Lehmann et al. [102] find connections between different DBpedia resources in the corresponding graph, but the selection of an explicit label from the paths determined is non-trivial.

#### **Goals and Contributions**

There are comparably few publications on combining corpus-based methods and techniques that integrate knowledge from online ontologies for the detection of non-taxonomic relations. This doctoral thesis aims at closing this gap by introducing a novel approach to detect labels for previously unlabeled non-taxonomic relations. It therefore combines corpus-based methods and knowledge derived from online semantic resources. The corpus-based methods extract verbs co-occurring with labeled as well as unlabeled relations from domain text, and generate labeling suggestions for unlabeled relations upon similarity values yielded by vector space models which include the most significant verbs. Knowledge from structured sources then refines these label suggestions. A typically small meta ontology defines a set of relation types (predicates) regarding their domain, range and property restrictions. Ontology reasoning with data from external sources grounds domain concepts occurring in unlabeled relations in the meta ontology. This allows the refinement of relation label suggestions from corpus-based methods by verifying the conformance to the ontological restrictions. The relation labeling component is an extension addressing shortcomings of an existing ontology learning framework [105] (see Section 4.4), but the approach is generally applicable.

The main contributions of this thesis are: (i) the presentation of a novel method which integrates techniques from ontology learning from text with reasoning on Semantic Web data, (ii) a formal description of the processes and algorithms involved, (iii) the creation of a modular and extensible framework that implements the proposed methods, as well as the documentation of major aspects of the implementation, (iv) the introduction of a method to semantically enrich arbitrary terms with mapping and reasoning techniques applied to linked data from DBpedia and online ontologies, (v) the provision of extensive formal experiments to assess the performance of the described methods, which also evaluate the accuracy of a number of variants and configuration settings. Gerhard Wohlgenannt - 978-3-631-75384-2

#### **Remainder of this Thesis**

Chapter 2 gives an overview of the broader context of the present work. It motivates the Semantic Web, characterizes its features and concludes with a section on Semantic Web applications.

Chapter 3 formally introduces ontologies and elaborates the main research areas related to ontologies. Furthermore, it describes representation languages for ontologies: A discussion of W3C's specifications of the languages RDF, RDF Schema and OWL provides the basics necessary to understand the datasets and ontologies used in Semantic Web applications, and also for the approaches presented in Chapter 4. Query languages and ontological reasoning help to leverage the full power of semantic applications, tools such as the Redland libraries or the Jena RDF toolkit yield the mechanisms necessary for handling RDF graphs. Finally, the chapter discusses the data sources and ontologies utilized in the thesis.

Chapter 4 gives an introduction into the research field of ontology learning, and then covers techniques and literature related to the novel methods presented in this thesis - those methods and the implementation thereof are a very significant constituent of Chapter 4. The first section describes the main ontology learning tasks along a set of layers, followed by a presentation of fundamental techniques from heterogeneous fields such as natural language processing, statistics or machine learning commonly applied in ontology learning. Furthermore, the chapter supplies an extensive survey of the state of the art with a focus on work in the area of learning non-taxonomic relations. The survey groups existing work by the type of input data, such as domain text corpora, the Web, or Semantic Web data sources, and by the methods applied in the learning process. The later part of the chapter outlines the novel methods developed for this thesis. The description contains the details about the two main elements of the method for labeling nontaxonomic relations, i.e. a set of algorithms that apply vector space models, and components to refine the results by reasoning on knowledge generated from information in external structured sources. The final section of the chapter depicts the architecture **which** implements the proposed methods.

Chapter 5 addresses the crucial issue of evaluating the methods described in Chapter 4. An extensive set of experiments evaluates the performance of the overall method to label non-taxonomic relations, as well as the most important components, especially the corpus-based methods (the vector space models) and concept grounding with the help of online semantic data.

Finally, Chapter 6 summarizes the presented work, it emphasizes the main contributions, draws conclusions and comments on open issues and possible lines of future research.

# **Chapter 2 The Semantic Web**

This chapter embeds the present thesis in its broader context of related research. It introduces the Semantic Web, which is an extension of the current World Wide Web, and is intended as global-scale collection of machinereadable statements. The chapter introduces the original visions regarding the Semantic Web and complements the visions with some considerations about its current status. It also discusses the characteristics and features of Semantic Web applications, and lists some commercial and academic projects.

## **2 .1 Overview**

The Semantic Web promises to solve some of the problems that exist regarding the current Web. Section 2.1.1 gives an overview of the basic ideas, design goals and the current status of the endeavor. Section 2.1.2 introduces features that distinguish the Semantic Web from the current Web as well as from traditional knowledge-based systems, for example how intelligent behavior emerges on the Semantic Web. Popular misconceptions about the Semantic Web presented in Section 2.1.3 help to clarify the concepts involved.

#### **2.1.1 Background and Vision**

While being the result of an unprecedented success story, the current Web is often inconsistent, disconnected, and out of sync. It feels like it is "a mile wide, but only an inch deep" [8, p 10]. An update of a bit of information in one place leaves the other places untouched, causing inconsistency. That is one of the reasons why many modern websites rely on relational database systems to generate website content on the fly. Database normalization tech- Gerhard Wohlgenannt - 978-3-631-75384-2

niques provide consistency locally, but databases usually do not integrate with third-party websites. Content on Web pages is made for human consumption, it is stored as **HTML,** or in application dependent formats such as files created with office programs. Distributed systems can hardly process and integrate content with other systems automatically, which leads to disconnectedness, and sometimes to frustrating effects for the user. Information, for example address data, has no explicit representation that can be processed by a machine, and therefore the consumer has to manually transform and transfer the content to use it in another service.

The Semantic Web in contrast is designed to provide a layer that makes smarter applications perform to their potential. Data in the Semantic Web is intended to be modeled and described in a way that makes it possible to integrate it on a global level - a "Web of data" [85] or a "Web of actionable information" [168, p 96]. As an example, if two different companies producing computer parts and exposing their product data on the web, a third company or service should be able to automatically understand and use that data [58]. "The main idea of the Semantic Web is to support a distributed Web at the level of data rather than at the level of presentation. Instead of having one webpage point to another, data items point to one another using global references called Uniform Resource Identifiers (URI)." [8] The applications or underlying database systems no longer hold the coherent data model used by the applications themselves, but is part of the Web infrastructure. The data items on the Semantic Web are described in a machine-readable, distributable way upon a single and distributed data model - making the Web less dumb.

Van Harmelen [185], referring to Marshall and Shipman in [115], presents a more diversified view on the Semantic Web - he distinguishes two types of goals: (i) In the first interpretation the Semantic Web aims towards the integration of structured and semi-structured data sources over the Web in order to federate and re-use those data sets. (ii) The second interpretation focuses on the enhancement of the current Web content with additional semantic metadata - where techniques such as concept extraction, namedentity recognition, automatic classification extract the metadata automatically. These conflicting assumptions also lead to some of the fallacies and criticisms about the Semantic Web presented below in the paragraph *Misconceptions and Criticism.* But a central aspect, which both interpretations agree on, is that the Semantic Web is a global-scale collection of formal, ontology-based and machine-readable statements about Web resources and other entities. Gerhard Wohlgenannt - 978-3-631-75384-2

#### *2.1. OVERVIEW* 25

The World Wide Web Consortium's (W3C) Semantic Web working groups1 are the major force supplying the Semantic Web's vision as well as its design principles, formal specifications and enabling technologies. Those specifications include work on RDF, RDFS, OWL, SPARQL and others - these technologies will be covered in Sections 3.2 to 3.4.

The Semantic Web is also known under the names Deep Web, Smart Web or sometimes as Web 3.0, although Tim Burners Lee described it as a part of Web 3.0: "I think maybe when you've got an overlay of scalable vector graphics - everything rippling and folding and looking misty - on Web 2.0 and access to a Semantic Web integrated across a huge space of data, you'll have access to an unbelievable data resource." [169]

Furthermore, the Semantic Web is a way to tackle a traditional problem in Artificial Intelligence (AI) research: the so-called knowledge acquisition bottleneck [55]. The knowledge acquisition bottleneck is concerned with the difficulty of the acquisition, representation and maintenance of an intelligent system's knowledge base [47]. Some people see it from a rather epistemological view, as the difficulties in formalizing knowledge to make it processable for machines, but pragmatically it is more an economic problem: the cost of acquiring and maintaining a knowledge base must be less than the economic benefits derived from the system. The knowledge acquisition bottleneck is a crucial problem in AI research, because after a phase focusing on general methods for problem solving and efficient theorem providing in the mid 1970s [47], many in the community realized that the fundamental problem of understanding intelligence is how to represent large amounts of knowledge in a way that permits their effective use [70]. Over the last 20 years more researchers developed robust and cost-effective knowledge-engineering processes, including technologies for specifying reusable model components (ontologies) and reasoning components - which have a strong influence on current Semantic Web components. The Semantic Web has a strong connection to **AI** research, but its key advocates argue that it Web is not **Al. AI** is concerned with engineering intelligent machines, while the Semantic Web is a technological infrastructure to enable large scale data interoperability [47]. Compared to classical knowledge-based systems with their closed domains it could open a way to new intelligent applications exploiting the large-scale and distributed knowledge supplied by the infrastructure.

Although much effort has been invested in tools and technologies, especially the formalisms, standards and languages provided by the W3C, with the effect that those technologies are quite mature nowadays, d'Aquin et al. [47] state that the Semantic Web, from an applicational point of view,

is still in an "embryonic state". The reason is that most of the existing applications only consume their own data, rather than the Semantic Web as a large scale information source. Motta and Sabou [123] present a number of criteria, which applications ought to satisfy to move away from this "First Generation" of Semantic Web applications to a new generation, most of which will be discussed in the next paragraph. According to Lee and Goodwin [101], the Semantic Web is mirroring the growth of the Web in the early nineties - and indicator that a large-scale adoption will become reality sooner or later. The Semantic Web search engine Swoogle<sup>2</sup> , for example, found 705,406,123 triples of semantic data as of April 2010. Similar numbers from other Semantic Web search engines also indicate that there already exists a useful knowledge source for intelligent applications. Several projects work on bringing more data online in order to increase its usefulness and applicability, most prominently the global Linking Open Data community project.3 Collectively, their data set consist of more than 13.1 billion RDF triples (April 2010).

#### **2.1.2 Features**

Viewing the Semantic Web as a very large Knowledge-Based System (KBS), d'Aquin et al. [47] present several key differences between classical KBS and the Semantic Web in the areas of (i) heterogeneity, (ii) quality, (iii) scale and (iv) reasoning. (i) KBS are built around small sets of carefully designed and integrated ontologies, whereas the Semantic Web makes the non-trivial effort of integrating very heterogeneous ontologies necessary, heterogeneous in terms of ontology encoding, quality, complexity, modeling and views. (ii) In KBS a small team of knowledge engineers builds ontologies in a centralized fashion. On the Semantic Web information stems from different sources and strongly varies in quality - so trust is a key issue. (iii) The Semantic Web with its millions of documents and billions of triples calls for a totally new way to locate and process data. (iv) Instead of sophisticated reasoning mechanisms used on generic tasks as applied by traditional KBS, Semantic Web applications rather draw their intelligence from scale, i.e. the sheer amount of data available.

The *Network effects* create a virtuous cyc!e[8] of content creation: The more people participate and put information or data online, the more attractive it is for new people to join. Metcalfe's law [170] describes this observation more formally. Another feature is the so-called *Open World Assumption,* 

<sup>2</sup>http: / / svoogle. umbc. edu, Statistics from 2009-06-27

<sup>3</sup> http://linkeddata.org Gerhard Wohlgenannt - 978-3-631-75384-2 Downloaded from PubFactory at 01/11/2019 05:40:36AM via free access

which implies that at any point new information can come to light, and that no conclusion may be drawn relying on the fact that the information available at a point is all information existing. The *Non-unique Naming,* the final feature, describes the fact that some Web resources may be referred to using different names by different people - so distinct URis need not refer to distinct resources.

## **2.1.3 Misconceptions and Criticism**

Van Harmelen [185] lists four popular fallacies or misconceptions about the Semantic Web. The first one is that the Semantic Web, or its standards, enforce meaning from the top onto users with formalisms such as **OWL.**  Van Harmelen counters that those standards are there for users to express their own meaning freely, and that they can assign their meaning to terms in vocabularies. Fallacy number two refers to the popular opinion that the Semantic Web requires everybody to conform to a single predefined meaning of terms - but in fact the motto is rather "let a thousand ontologies bloom". This is also a reason why much research effort is invested in the area of ontology mapping (see Section 3.1.3). The third fallacy is that the Semantic Web requires users to understand the details of formalized knowledge representation. Although the details of ontology languages are complicated matters, not every user need to known them, as a user doesn't have to know HTML or CSS to navigate the current Web. The last of the misconceptions is that the Semantic Web people will demand the manual markup of all existing Web pages. The Semantic Web relies on automation of large-scale markup extraction from current Web representations, mostly with lightweight semantics. Many modern Web applications address this issue by creating annotations in machine-readable formats upon the publishing of data, for example as microformats. <sup>4</sup>

Alani et al. [5] present a few misconceptions about the Semantic Web from the viewpoint of adoption and application of its technologies in organizations, some of which overlap with the fallacies described in the paragraph above. The misconception that ontologies are typically large and complex, and that they are expensive to design, build and maintain is countered with the argument that applications don't always require heavyweight and complex ontologies of domain knowledge, but that lightweight ontologies often suffice. Lightweight ontologies (see Section 3.1) can have a wide applicability, and they are cost effective to build in terms of overall utility to the community. Some decision makers worry that existing data has to be expensively converted to Semantic Web formats, and current technologies replaced. However, simple scripts or conversion languages can often automatically accomplish the conversion - data is kept in the current format and exported when needed. Many organizations suspect that providing public access to their data only benefits the public; but as the current document Web has shown, there are economic gains for the owners of information, too. The last cause for worries covered by Alani et al. [5] is the fear that the promiscuous release of data and information will be a privacy nightmare. In fact there are standards being developed for access control, and in the meantime, as with conventional database and Web technologies, organizations can choose which data they share publicly.

Peter Gardenfors [65] criticizes the Semantic Web effort on a different level. Relating to arguments by Shirky [171] he states that the Semantic Web with its "neat ontologies and syllogistic logic" is not that effective in the real world where a shared world view is hard to create. Reducing semantic content to first order logic or set theory, he doubts that Web Ontology Language (OWL) can express important notions like *similarity* in a natural way. He also refers to the *symbol grounding problem* [76], which is about the concern how a symbolic expression can obtain any meaning that goes beyond the formal language itself - and be grounded in the external world in terms of meaning. Gardenfors argues that John Locke already brought up this problem in the year 1690 in his *Essay Concerning Human Understanding* [106] where he described the difficulty of agreeing on the precise number of simple ideas belonging to any sort of thing, or its qualities.

### **2.2 Applications**

The later parts of this section discuss some applications that make extensive use of Semantic Web technologies. If applications also integrate the massive amounts of Semantic Web data and documents that are available on the Internet, then d'Aquin et al. [47] call them "next generation Semantic Web applications". Section 2.1.2 describes the set of features that distinguish next generation applications from classical knowledge-based systems. Because the Semantic Web combines heterogeneous sources, variable data quality and global-scale distributed data, those applications will derive their intelligent behavior rather from the capability to exploit large amounts of data than from complex inferencing - intelligence comes as a side effect of scale. Other types of reasoning, partly on non-semantic data, become crucial: reasoning based on machine learning and on linguistic and statistical techniques. In contrast to next generation Semantic Gerhard Wohlgenannt - 978-3-631-75384-2 Web applications the first generation typically uses just a single ontology that supports the integration of a set of data sources fixed at design time.

D'Aquin et al. [47] present the features of next generation applications: (i) The application needs to be able to find relevant information on the Web for the task at hand dynamically. (ii) The application has to select appropriate information (in terms of quality, etc.) from the documents found in (i). (iii) As the application must be able to exploit heterogeneous knowledge sources, it cannot make assumptions about the ontological nature of target information. (iv). Ontologies and resources must be combined - as it cannot be expected that one single source provides all necessary information. To be able to leverage the power of online semantics, it is crucial to have a single access point to the data. This access point collects, analyzes, and indexes Semantic Web data and provides it to the applications. As current access points such as Swoogle5 and Sindice6 have limitations, d'Aquin et al. developed Watson [48] as a new Semantic Web gateway to provide mechanisms for extracting semantic documents with keyword search, retrieving their metadata, and querying the content (e.g. with SPARQL). "Watson offers applications all the necessary elements to select and exploit online semantic resources". Among the applications that build on the Watson gateway are Power Magpie, PowerAqua and Scarlet [48]. PowerMagpie helps users to interpret arbitrary Web content by extracting and summarizing important conceptual entities relevant to a page, it highlights those entities and puts them in context with dynamically retrieved ontologies. Power Aqua is a question-answering system based on an unlimited number of ontologies, which is able to combine various ontologies at runtime. Scarlet explores ontologies to automatically retrieve relations between two input concepts - Scarlet will be discussed in more detail in Section 4.3, as it is integrated into the system developed for the present thesis.

Corporations still use Semantic Web applications quite rarely, Alani et al. [5] state that "it's probably fair to say that many organizations still view the Semantic Web with some scepticism. In part, they may suspect that they're expected to pioneer an approach in which quick wins are few". Furthermore, they worry about cost and privacy issues when linking everincreasing amounts of data to the Web. Some of the misconceptions have already been addressed in Section 2.1.3, Alani et al.[5] analyze the special characteristics of using Semantic Web technologies in corporations. They argue, that it offers local and private gains indeed for individuals and organizations that link their data and information. Some of the factors to

**<sup>5</sup>http://swoogle.umbc.edu** 

make the deployment of Semantic Web technologies attractive are: Minimize disruption to existing infrastructure, e.g. gradually convert existing data to Semantic Web formats with simple scripts. Use small, well-focused ontologies for individual information assets to keep efforts of ontology development low. Show the added value gained by integration and shared access, for example consistency checking, and provide relative ease of integration and efficient data exchange and merging.

Already in 2006 van Harmelen [185] observed a shift in company profiles that are active in the Semantic Web field from small start-ups to big corporations. He lists the following areas where respective technologies begin to take shape: knowledge management, mostly for intranets of big corporations; data-integration ( e.g. at Boeing); e-Science, esp. life sciences; convergence of the Semantic Grid.

This overview of some of the aspects of Semantic Web applications concludes with a few examples of current Semantic Web applications. Siri7 is a personal assistant for the mobile phone capable of doing simple assistance jobs and answer questions such *as* "Where is the nearest shop?", or to execute commands like "I need a cab". Siri is born out of SRI's CALO Project, the largest Artificial Intelligence project in U.S. history ( according to the Siri Web site). The ambitious vision is that in the next five years almost everyone with a connected lifestyle will delegate details of day-to-day tasks to intelligent assistants, which coordinate and simplify the details of their lives. True Knowledge8 provides a question-answering system to respond to questions in any domain. It has a search engine-like natural language user interface. The application aims at giving instant and precise answers to questions - *as* opposed to current Web search engines, which just return a long list of possibly related documents. True Knowledge relies on Semantic Web technologies to answer complex questions by drawing inferences and conclusions on that data. WolframlAlpha9 is another question-answering system. It relies on a formal Mathematica representation at its heart. WolframlAlpha mostly depends on its own data and does not apply Semantic Web technologies or data extensively, and therefore is no Semantic Web application in the narrow sense. Triplt10 automatically organizes all of a user's travel information into a master travel itinerary that is easy to share and access. The master itinerary aggregates a lot of travel-related information in one place. Twine11 is a service to track, find and share content. Twine uses

11http://www.twine.com Gerhard Wohlgenannt - 978-3-631-75384-2

<sup>7</sup>http://www.siri.com

<sup>8</sup>http://www.trueknowledge.com

<sup>9</sup>http://www.wolframalpha.com

<sup>10</sup>http://www.tripit.com

Downloaded from PubFactory at 01/11/2019 05:40:36AM

Semantic Web technology to help people organize, disseminate and discover information related to their interests. The application stores information as RDF triples and makes them accessible via the Twine APls. TopQuadrant12 supports companies in moving from disparate data into integrated, actionable and reusable knowledge, using the product TopBraid Suite, which is a set of components for semantic solutions. The SemanticMiner is one of the products created by ontoprise. 13 This application provides semantic search capabilities for companies. Leveraging the power of ontologies, this product supports moderated search, the optimization of search results and also gives an integrated view on heterogeneous sources of data and information.

**<sup>12</sup>http://11VV.topquadrant.com** 

**<sup>13</sup>http://11VV.ontoprise.de** Gerhard Wohlgenannt - 978-3-631-75384-2 Downloaded from PubFactory at 01/11/2019 05:40:36AM via free access

# **Chapter 3 Ontologies**

This chapter introduces ontologies from a Semantic Web viewpoint. It covers fundamental aspects such as definitions, languages for ontology representation, querying and reasoning, as well as public datasets and ontologies used in the later parts of this thesis.

Section 3.1 provides definitions and fundamental characteristics of formal domain conceptualizations referred to as ontologies, which serve as a vocabulary for the Semantic Web. Furthermore, the section discusses some of the main research fields regarding the topic. The following sections then present practical aspects of the Semantic Web, i.e. existing technologies and standards which implement the original ideas and tools for developing Semantic Web applications. Among those technologies are the languages for representing ontologies, e.g. RDF, RDF Schema and OWL, presented in Section 3.2, and standards for Semantic Web graph querying and tools for reasoning, such as Jena or Redland (Section 3.3). Section 3.4 discusses public ontologies and Semantic Web datasets which were applied in the course of this thesis, for example the DBpedia and Freebase datasets, or the OpenCyc ontology.

## **3.1 Fundamentals**

This section formally defines the term ontology as well as the entities that constitute an ontology. Furthermore, it discusses the motivations to build such conceptualizations relying on the work of Noy and McGuinness [126], and distinguishes lightweight and heavyweight ontologies. The Semantic Web in general, and the area of ontologies in particular, are research fields that have gained a lot of attention over the last years. Section 3.1.3 provides an outline of the major tasks in ontology research. Gerhard Wohlgenannt - 978-3-631-75384-2

#### **3.1.1 Purpose**

Most work in computer science about ontologies mentions the roots of the term *ontology* in philosophy, especially Greek philosophy. *Ontology* is the study or science of being, existence or reality. Cimiano [37] elaborates on the elements of the ancient roots that are particularly relevant for the computer science use of the concept. Platon (427-347 BC) laid the foundation for ontology by explicitly contrasting the world of forms or ideas from the physical, observed, plane. His student Aristotle (384-322 BC) formed the logical background by introducing notions such as *category* and *subsumption,*  and by creating hierarchies with the concepts of *genus* and *subspecies.* With the help of *dijferentiae* he classifies objects into categories, thereby creating subspecies of one genus. "In fact, Aristotle can be regarded as the founder of taxonomy, i.e. the science of classifying things." [37, p 9].

*Ontologies* provide the vocabulary that is used in the Semantic Web. Ontologies are models containing concepts and relations that are relevant to a particular task or application domain [23]. Gruber [73] states that every knowledge-based system or knowledge-level agent is committed to some implicit or explicit conceptualization. Such a conceptualization is an abstract and simplified view of some part of the world, and contains its objects, concepts, and other entities and relations that hold between them. An ontology is a formal specification of a shared conceptualization of a domain of interest. "In such an ontology, definitions associate the names of entities in the universe of discourse ( e.g. classes, relations, functions, or other objects) with human-readable text describing what the names mean, and with formal axioms that constrain the interpretation and well-formed use of these terms. Formally, an ontology is the statement of a logical theory" [73, p 909].

Noy and McGuinness [126] summarize the motivations and reasons for the development of ontologies as follows:


## **3. 1.2 Structure and Entities**

Ontologies usually include a taxonomic backbone, i.e. a hierarchy of concepts connected by *is-a* relations. Figure 3.1 shows a very small example ontology, the concepts connected which directed links form the hierarchical structure ( the taxonomy). Sub-concepts inherit the properties of parent concepts, as in the example *Student* inherits all properties of *Person.* Next to *is-a* relations any number of non-taxonomic relations are possible between concepts, such as the *work-for* relation between *Professor* and *University.* Another important distinction is between concepts and instances of concepts, instances are individuals associated with a concept.

Building on Maedche et al. [108], the present work uses a lightweight definition of entities that define an ontology; for a more formal definition the interested reader is referred to [37] or [179]:


*Figure 3.1:* A small example ontology

The notion of *domain and range* restrictions on relations is of particular importance, as this doctoral thesis extensively uses those definitions in later sections. For a binary relation between two terms, also referred to as a "slot", the first term must be an instance of the class that is the domain of the slot and the second must be an instance of the class that is the range of the slot. So for example one could represent the slot *mother* in a way that the domain is *Female Animal* and the range is *Animal.* So domain and range restrict the terms ( or instances of classes regarding ontologies) that constitute a binary relation to a certain class (domain) or certain values (range). For the *worksAt* relation in Figure 3.1 one might define the domain of the relation to be instances of class *Person* and the range to instances of *Organization.* 

Noy and McGuinness [126] provide a simple step-by-step knowledgeengineering methodology for the construction of ontologies. Lassila and McGuinness [100] show the spectrum from very lightweight and informal ontologies to richly axiomatized heavyweight ontologies on a continuous line, see Figure 3.2. Not all ontologies share the same amount of formal explicitness [45], nor do they include all the components that can be expressed in a formal language, such as concept taxonomies and various types of formal axioms. Therefore, the ontology community usually distinguishes lightweight and heavyweight ontologies [178]. Gerhard Wohlgenannt - 978-3-631-75384-2

*Figure 3.2:* From lightweight to heavyweight ontologies [100]

Corcho [45] gives examples for ontologies used for document annotation within the described spectrum. Many organizations apply the Dublin Core1 element set, a lightweight ontology which belongs to the category *terms/ glossary* and is used to specify the characteristics of electronic documents. The popular FOAF (Friend-Of-A-Friend) 2 vocabulary aims at the creation of a Web of machine-readable pages describing people and the links between them, as well as the things they create and do. FOAF can be regarded as a *formal instance.* An example of a heavily axiomatized ontology is GALEN3 [185], an ontology in the domain of clinical medicine. D'Aquin et al. [47] found that around 95% of online ontologies included in the Watson Semantic Web gateway are lightweight ontologies - big, dense, and large-scale ontologies are comparatively rare.

#### **3.1.3 Ontology Research Fields**

This section summarizes the most important ontology research fields in a Semantic Web context. The core research area of the present work, ontology learning, holds strong connections to the other topics, which are ontology population, ontology evolution and ontology alignment.

#### **Ontology Learning**

Knowledge engineers may build ontologies manually using guidelines [126] or use methodologies for ontology construction such as Methontology [59] and the Melting point methodology [64] for decentralized ontology development, However, the present work focuses on (semi-)automatic ontology learning. This (semi-)automatic process leverages information from various sources to generate ontologies, information such as text, data from semi-structured sources [175, 163] or from structured sources [4].

<sup>2</sup>http://vvw.foaf-project.org

:ihttp: / /vvw. opengalen. org Gerhard Wohlgenannt - 978-3-631-75384-2 Downloaded from PubFactory at 01/11/2019 05:40:36AM via free access

<sup>1</sup>http://vvw.dublincore.org


Later sections of this thesis focus on learning from unstructured data (text) as well as on learning from Semantic Web data and ontologies available online. Section 4.1 provides extensive information on the research area of ontology learning.

#### **Ontology Population**

The aim of ontology population is to learn both instances of concepts as well as relations [37]. Hence the task is to learn the *instance-of* relation, it is thereby very closely related to many tasks in the area of ontology learning. If the ontology population application keeps a link to the text where the instances were detected and if it contextualizes the assignment with the context specified by the documents or text in question, then the task is referred to as *knowledge markup* or *annotation* [37, 45]. There is a strong relation between ontology population, Named Entity Recognition (NER) and information extraction (IE). Applying natural language processing techniques IE deals with filling predefined sets of *knowledge structures* ("templates").

An example of this is the *seminar announcements* task, where the goal is to extract the location, speaker, topic, or date of a seminar announcement from a document [37]. NER is traditionally concerned with finding instances of certain concepts (person, organization, location) in text. Current NER approaches go beyond this basic set of classes [37]. A major difference between NER and ontology population is that NER classifies each occurrence of a term in a text separately, while Gerhard Wohlgenannt - 978-3-631-75384-2 ontology population classifies the term

itself, independent of context [182]. IE and NER are restricted to a set of templates or concepts. When dealing with the much bigger number of more fine-grained concepts and slots defined in ontologies, these methods face a serious scalability problem [37]. In addition, the creation of annotated training data becomes almost impossible as the set of concepts changes with every new ontology [182]. Therefore the ontology population task is traditionally tackled with unsupervised methods, whereas NER and IE often rely on supervised methods. Tanev and Magnini [182] distinguish two main paradigms in ontology population: Using patterns [83] or relying on the structure of terms [186], such as Cimiano and Volker [38] who use contextual features. Pattern-based approaches look for phrases in text that explicitly show typed relations, such as the "is-a" relation, for example in "animals such as cats and dogs" *(term1* such as *term2* and *term<sup>3</sup> ).* Those methods then extract the instance terms (for more details see Section 4.2.2). As such phrases do not occur frequently in text, some approaches use the Web as corpus [160]. Context feature methods use features from the context in which a concept appears; those features are also extracted from a corpus [182]. Syntactic features tend to lead to better results than superficial features [38].

#### **Ontology Evolution**

The first wave of work in the field of ontologies focused on ontology construction, not taking into account that the encapsulated domain knowledge changes over time [74]. As Fensel [57] states, in an open and dynamic environment the domain knowledge constantly evolves, Shadbolt et al. [168] stress the importance to adopt to those changes and call ontologies "living structures". Plessers et al. [130] define ontology evolution as "the process of adaption of an ontology to arisen changes in the corresponding domain while maintaining both the consistency of the ontology itself as well as the consistency of depending artifacts". Examples for such artifacts are related ontologies, dependent Web sites or Web applications. There are multiple causes that require changes in an ontology: the application domain or user requirements may change, or design flaws may be detected. Stojanovic et al. [177] distinguish *usage-driven changes* on user or ontology engineers requests, and *data-driven changes,* which reflect changes in the described domain. Weichselbraun et al. [189] take a close look at *data-driven changes* and analyze changes in a concept's importance and reasons for change in thesemantics of a concept itself. Plessers et al. [130] and Haase and Stojanovic [74] agree upon the fact that ontology evolution is a non-trivial problem and cannot be performed by an ontology engineer Gerhard Wohlgenannt - 978-3-631-75384-2 manually, but has to be supported

by an ontology management system, which ensures consistency and transparency among a group of ontology engineers.

#### **Ontology Alignment**

With the development of many new ontologies in the context of the Semantic Web, an increasingly important feature is their reuseability. However, to reuse an existing ontology together with a new one they need to be integrated [6], especially if they cover overlapping domains. This problem is known as ontology alignment ( also referred to as ontology matching, ontology mapping, ontology integration, or semantic integration) and is one of the most active research areas in the Semantic Web community [185]. Shvaiko and Euzenat [172] define ontology matching as the task of determining the relations between entities in two ontologies. Noy and Musen [127] distinguish ontology alignment from ontology merging. Ontology merging tries to create a single coherent ontology from multiple input ontologies, whereas ontology alignment establishes links between ontologies and also provides for the reuse of information. Alignment can be facilitated by creating direct links between ontologies, or also by linking the two ontologies to a third ontology, which serves as mediator. Traditional techniques in ontology alignment focus on two tasks [185]: (a) Lexically matching the elements in the ontologies using string-based and linguistic methods to detect relatedness based on labels used. (b) The exploitation of the ontology's structure (i.e., the relations), in order to detect similarities. Van Harmelen [185] investigates the use of background knowledge, and exploits the structure of a third extensive background ontology to acquire additional information for the matching process. Sabou et al. [151] extend this work by automatically finding and exploring multiple and heterogeneous online knowledge sources from the Semantic Web in order to derive mappings.

## **3.2 Representation**

When referring to Semantic Web technologies most authors discuss the Semantic Web layer cake<sup>4</sup> , which is presented in Figure 3.3. The layer cake combines standardized technologies from the lower levels of the figure with abstract notions such as *trust* and *proof,* on top of which *user interfaces and applications* are to be built.

This introduction gives only a brief overview of the Semantic Web layer cake elements, more information follows in the upcoming sections. On the

*Figure 3.3:* The Semantic Web layer cake

lowest level of the layer cake are the Uniform Resource Identifiers (URis)<sup>5</sup> and Internationalized Resource Identifiers (1Rls)6 which serve as identifiers for abstract or physical resources. A well-known subset of URis are Uniform Resource Locators (URLs) as applied for the identification of Internet resources. IRis add internationalization support to URis by using the Universal Character Set (Unicode/ISO 10646) characters for the identifiers. According to the specification, every URI is also an IRI. The Extensible Markup Language (XML) <sup>7</sup>is a markup language geared towards the representation of hierarchically structured data as text files. XML is a common format used for data exchange and integration on the Internet. The Resource Description Framework (RDF) offers tools to make statements about resources, and to link them. RDF models are often serialized as RDF /XML, but the models exists independently of serialization. RDF Schema is a semantic extension of RDF, it provides the functionality to describe RDF domain vocabularies. RDF Schema allows the specification of groups of related resources and also relations between that resources. OWL is far more expressive than RDF Schema, it provides additional vocabulary and adds formal semantics. SPARQL is a query language for RDF models, which includes powerful capabilities to express queries across diverse data sources. The other layers in the layer cake are Rule Interchange Format (RIF) for the description of rules

<sup>5</sup>http://vvv.ietf.org/rfc/rfc2396.txt

Hhttp://vvv.ietf.org/rfc/rfc3987.txt

<sup>7</sup>http://www.w3.org/XHL Gerhard Wohlgenannt - 978-3-631-75384-2 Downloaded from PubFactory at 01/11/2019 05:40:36AM via free access

where OWL is not sufficient, *cryptography* in order to verify that Semantic Web statements are coming from a trusted source, and the rather abstract notions of *unifying logic* and *proof* 

The following Sections 3.2.1, 3.2.2 and 3.2.3 give an introduction to three of the major building blocks to generate and represent Semantic Web data and ontologies. RDF provides the basic framework needed to create statements about resources, RDF Schema adds terminology to create taxonomies and simple constraints, and OWL yields additional facilities for the description of formal relations between classes.

### **3.2.1 Resource Description Framework**

This section deals with the basic concepts and elements of RDF, the Resource Description Framework. RDF's purpose is to represent information about resources on the World Wide Web. But the concept is quite general, it allows not only the description of resources that are retrievable on the Web, but also of resources that are identified with the help of URis. RDF builds on a rather simple model, aiming at large-scale management and processing of statements about resources, which can be combined to provide a globalscale network of information. Therefore, RDF is designed for distribution and exchange of data between applications without a loss of meaning. Most of the information given in this section, and also some of the example code, bases on W3C's RDF Primer<sup>8</sup> .

#### **Basic Concepts**

RDF is primarily geared towards being processed by machines and is not a format to be consumed directly by humans. A basic idea in RDF is the identification of things with Web identities, which are called Uniform Resource Identifiers. URis allow for the identification of physical or abstract resources, for example Web pages, people, products. Properties and property values describe resources in terms of simple statements. The statements form a graph, which reflects the resources and their relations to other resources or to literals. An example graph is given in Figure 3.4.

Figure 3.4 includes a few statements made about a resource (<http: **//www.w3.org/People/EM/contact#me>** ), for example that the name of the resource is "Eric Miller", and that it has an email address **em©w3. org.** The example shows the two main facilities used in RDF to identify things: URis and literals. URis represent individuals (like **<http://www. w3. org/People/** 

Figure *3.4:* An RDF graph describing Eric Miller, adopted from [112]

EM/contact#me>, which refers to the individual "Eric Miller"), kinds of things ( e.g. the notion of a person, identified by <http: *I* **/www. w3.** org/2000/ 10/swap/pim/contact#Person>), and also properties (for example a *hasmailbox* in http://www. w3. org/2000/10/swap/pim/contact#mailbox). A corresponding RDF /XML representation (using an abbreviated syntax) of the graph shown above is as follows [112]:

```
<?xml version="1.0"?> 
<rdf :RDF xmlns: rdf="http://www. w3. org/1999/02/22-rdf-syntax-ns#" 
  xmlns:contact="http://www.w3.org/2000/10/swap/pim/contact#"> 
 <contact: Person 
     rdf:about="http://www.w3.org/People/EM/contact#me"> 
    <contact: ful!Name>Eric Miller </contact: ful!Name> 
    <contact: mailbox rdf: resource="mailto: em!Ow3. org"/> 
    <contact: persona!Title>Dr.</contact: persona!Title> 
 </contact: Person> 
</rdf :RDF.:>
```
The details about RDF statements and the RDF /XML syntax needed to understand the given examples will follow in the upcoming sections. RDF statements have a very simple structure. They always include a *subject,* a *predicate* and an *object.* Therefore, Gerhard Wohlgenannt - 978-3-631-75384-2 RDF statements are also referred to as

*RDF triples.* The subject is the thing (resource) that a statement describes, the predicate is the specific property of the subject which is described, and the object is the value of that property. So the simple English sentence *"mailto:wohlg@ai.wu.ac.at* is the *email address* of *Gerhard Wohlgenannt"*  corresponds to a triple where the subject is mail to: wohlg©ai. wu. ac. at, the predicate is *email address* and the object is the literal "Gerhard Wohlgenannt".

RDF is intended to be machine-processable, it needs machine-processable identifiers and a machine-processable language. URis provide appropriate identifiers, RDF uses URI references to resources (URirefs). URirefs start with a *schema* part, followed by a colon and end in a schema-specific part. An example for this is ftp://some.address.net/a/file.txt where ftp denotes the schema. Other examples are urn:issn:1111-9137 or http: *I* /www. weblyzard. com. A *fragment identifier* is the optional last element of an URI, separated from the rest with the character #. The identifier http: //www.weblyzard.com/index.html#person27 combines the URL http:// www.weblyzard.com/index. html with the fragment identifier person27. In RDF statements subjects and predicates are represented by URirefs, objects by URirefs or by literals. To represent statements in a machine-processable and exchangeable way, RDF uses XML and defines a specific XML markup language (RDF /XML).

A simple alternative to drawing graphs is to write down RDF statements in *triples.* The statements which correspond to the graph in Figure 3.4 are:

```
<http:/ /www. w3. org/PeoplejEM/ contact#me> 
   <http:/ /www. w3. org /2000/10/swap/pim/ contact#fullName> 
    "Eric Miller" 
<http:/ /www. w3. org/People/EM/ contact#me> 
   <http://www. w3. org /2000 /10 / swap /pim/ contact#personal Ti tie> 
    "Dr." 
[ ... ]
```
The full triple notation writes the complete URirefs out inside angle brackets, which leads to long lines. There is a shorthand substitution, the so-called **XML** Qualified Names ( **QN** ames). **A QN** ame consists of a namespace prefix and a local name - separated by a colon. The namespace prefix needs to be assigned to a namespace URI beforehand. Using QNames the statements from the listing above are expressed as [112]:

@prefix col: @prefix co2: <http:/ /www. w3. org/ People /EM/ contact#> . <http:/ /www. w3. org /2000/10/swap/pim/ contact#> col :me co2: ful!Name "Eric Gerhard Wohlgenannt - 978-3-631-75384-2 Miller" Downloaded from PubFactory at 01/11/2019 05:40:36AM via free access

```
I col :me co2: personalTitle 
 l ... l 
                                   "Dr."
```
Table 3.1 presents some well-known prefixes commonly used with **RDF.** 


*Table 3.1:* Commonly applied prefixes and the respective namespace URis

RDF uses URirefs to convey meaning, sets of URirefs are called vocabularies. Such vocabularies typically base on URirefs within a common namespace - so terms from the vocabulary are chosen by combining the namespace prefix with a local name. Examples are the vocabularies rdf: and rdfs: given in Figure 3.1, which include the terms defined by RDF itself and the terms from RDF Schema (see Section 3.2.2). The usage of common namespace prefixes for a vocabulary is just a convention. The RDF model does not assume any relation between URirefs from a common namespace and it is common practice to mix URirefs from various namespaces in an RDF file.

The use of URirefs for the identification of things has several advantages over the use of literals. Literals like "Eric Miller" are inherently ambiguous, *as* there exist many persons named "Eric Miller". URlrefs provide a preciser identification of a resource, and the use of URirefs for properties yields the opportunity to give additional information and a clear semantics for that property. URirefs do not solve all problems, e.g. the problem that different URirefs may refer to the same thing is evident. **OWL** provides terminology to mark classes and individuals *as* equivalent. On the other hand organizations should try to use wide-spread terminology such *as* Dublin Core9 where applicable, instead of creating their own.

RDF facilitates the representation of *structured information,* e.g. an address that consists of a number of fields such *as* street name or postal code, in two ways: The first option is to create an *intermediate* URiref to represent the aggregated concepts - this creates a universal identifier. If there is no need for such an identifier, then so-called *blank nodes* are a better choice. Blank nodes are anonymous resources, and they only have local identifiers, which are unique for the respective graph. As RDF allows binary relations (relations between a subject and an object), blank nodes provide a

workaround to break down n-ary relations in binary ones. The subsequent listing, which breaks the n-ary relation between an individual and an address down with the help of the blank node \_: j **obnaddress** exemplifies the use of blank nodes [112]:


As already mentioned, RDF supports the use of literals as values of properties (as objects). Next to plain literals, such as the string "Eric Miller" or "27", RDF provides *typed literals.* Plain literals involve the problem that the application processing the respective RDF statements has no additional information on how to parse the given data, a string "27" may be handled as the characters "2" and "7", as the decimal number 27, or as the octal number 27, etc. A typed literal is formed by attaching the URiref which identifies the datatype to the literal, for example:

exstaff:85740 exterms:age "27""xsd:integer .

**xsd:** integer is the abbreviated form of the full URlref <http://www. w3. org/2001/XMLScbema#integer> and marks the literal as a decimal number. So typed literals provide a way to specify the datatype of a string. The datatypes themselves are defined externally to RDF. It is common practice to use XML Schema datatypes in this context.

#### **RDF/XML**

RDF /XML is an XML syntax used to write down (serialize) and to exchange RDF graph models. The RDF /XML Syntax Specification10 defines RDF /XML. The following example demonstrates some of the basic aspects ofRDF/XML:

```
<?xml version="l.0"?> 
<rdf :RDF xmlns: rdf="http://www. w3. org/1999/02/22-rdf-syntax-ns#" 
         xmlns:exterms="http://www.example.org/terms/"> 
  <rdf:Description rdf:about="http://www.example.org/idx.html"> 
    <exterms: creationDate>August 16, 1999 
    </exterms: creationDate> 
  </rdf: Description> 
</rdf :RDF.>
```
The example starts with <?xml version=" 1. 0"?>, this states that the subsequent data is XML formatted and gives information about the version used. RDF documents are required to be well-formed XML, but no validation against a Document Type Definition (DTD) is intended. Every RDF file has to start with an rdf: RDF element, which is closed at the end of the file. RDF files contain namespace declarations, which may be attributes of the rdf: RDF tag. xmlns: rdf="http: **//ww. w3.** org/1999/02/22-rdf-syntax-ns#" defines all resources that start with rdf: as part of the http:/ **/ww. w3.** org/ 1999/02/22-rdf-syntax-ns# namespace. The rest of the file contains the actual statements. The example lists only one statement, but RDF permits an arbitrary number of statements per document. The rdf: about attribute at the beginning of the statement denotes the subject element. The next line provides the property element, in this example <exterms: creationDate>. Finally, the value for property, i.e. the object, is included as a literal. So the subject encloses the property element, which itself encloses the object. Distinct rdf :Description elements separate the various statements.

RDF includes a number of abbreviation formats to simplify RDF /XML, it is common practice to combine multiple statements that have the same subject:

```
<rdf: Description rdf: about="http://www. example. org/ idx. html"> 
  <exterms: creationDate>August 16, 1999 
  </exterms: creationDate > 
  <de: language>en</dc: language> 
  <dc:creator 
    rdf:resource="http://www.example.org/staffid/85740"/> 
</rdf: Description>
```
The previous example integrates three statements about the resource http: *I* /ww. example. org/ idx. html by enclosing three property elements into a single rdf: Description tag. The last property element shows how to specify resources (URirefs) as objects. The rdf: resource attribute to the property element indicates the use of an URiref property value. QNames are illegal in property attributes, therefore the attribute includes a full URiref in the example statement.

There are several possible ways to represent *blank nodes.* A direct approach is to assign a blank node identifier, which is unknown outside the particular RDF /XML document. The blank node is referred to by the attribute rdf: Node ID instead of rdf: about or rdf: resource.

Optional rdf: data type attributes to the property element specify the datatype of literals, as in this example:

```
"http://vvv.v3.org/2001/XMLSchema#date">1999-08-16 
  </exterms: creationDate> 
</rdf: Description>
```
Both plain and typed literals may include Unicode characters.

Instead of including full URirefs in the rdf: about attribute of a subject, rdf: ID can be used together with a fragment identifier. For example a <rdf:Description rdf:ID="item11"> [ .. ] is essentially equivalent to specifying a <rdf: Description rdf: about="http://www. example. org/products#i tem11". The fragment identifier is interpreted relative to the *base URI,* which by default is the URI of the document itself. Joining the fragment identifier and the document URI with a "#" yields the absolute URiref [112]. It is good practice to specify a base URI in RDF documents, this allows to distribute the document to different locations on the Web, and still have unchanged full URis to the resources defined in the document. Similar to namespace information, the **xml: base** element is an attribute of the rdf: RDF tag. It defines the base URI, for example xml: base="http://www. example. org/products ". By assigning URirefs to resources the RDF framework provides global identifiers. The descriptions of particular resources need not be included in one single document, it's possible to distribute them throughout the Web.

**XML** entities help to abbreviate even the resource values of attributes. This increases the readability of **RDF** documents. In the example below a DOCTYPE declaration is added at the beginning of the file - this associates the name xsd with the string following the name inside the ENTITY clause.

```
<!DOCIYPE rdf :RDF [ <!ENTITY xsd 
   "http://vvv.v3.org/2001/XMLSchema#">]>
```
The **XML** preprocessor **will** replace the entity reference **&xsd;** elsewhere in the document with the full URiref. The statement given above now takes this form:

```
<rdf: Description rdf: about="http: //vvv. example. org/idx. html"> 
  <exterms: creationDate rdf: datatype=" &:xsd; date"> 
    1999-08-16 
  </exterms: creationDate> 
</rdf: Description>
```
So far this section presented the mechanisms to describe individuals in RDF /XML. But a very import concept in RDF is to categorize those individuals, to assign them to a type. The rdf : type property provides this functionality. The following example demonstrates the categorization of a resource, this is also called *instantiation* Gerhard Wohlgenannt - 978-3-631-75384-2 - the subject resource is declared to be an instance of the object resource. The statement specifies that the item with the relative URI #i tem11 is of type http: //ww. example. com/terms/Tent.

```
<rdf :RDF xmlns: rdf="http://www. w3. org/1999/02/22-rdf-syntax-ns#" 
     xmlns:exterms="http://www.example.org/terms"> 
     xml:base="http://www.example.org/products"> 
<rdf: Description rdf: ID=" i tem11 "> 
  <rdf:type rdf:resource="http://www.example.com/terms/Tent"/> 
  [ .. other properties] 
</rdf: Description> 
[ .. ]
```
The definition of classes (like Tent) is not possible in RDF itself, but RDF Schema and OWL provide such capabilities. As the description of type information is very common in RDF, the following abbreviation syntax is a substitute to defining the type with rdf: type explicitly. The QName of the resource that refers to the category replaces the rdf :Description element:

```
<exterms: Tent rdf: ID=" i tem11 "> 
   [ .. other properties] 
</exterms: Tent>
```
*Containers* provide a means to group things in RDF models, for example to list the students participating in a course. Containers are resources that contain things ( resources or literals). The contained things are called *members.* RDF provides vocabulary for three predefined types of containers, namely *Bag, Sequence* and *Alternative.* Members in bags ( rdf: **Bag)** are not ordered in any way, and bags may contain duplicates. Sequences (rdf: Seq) may also include duplicates, but, as the name suggest, the order of the members is significant. The Alternative container (rdf :Alt) includes a number of alternatives, typically only one of them is chosen by the application processing the data. Bags might be appropriate for example to record information about products in a shopping cart, the Sequence container might represent an alphabetically sorted list of students, and the Alternative container is often used to store alternative language translations.

The rdf :type property describes a resource as a container. The member elements have properties with names rdf: \_n, for example rdf: \_1 and rdf: \_2. RDF /XML includes rdf: li as convenience elements, which result in the generation of the corresponding rdf : \_n elements when forming the corresponding graph. A snippet representing an example of an *Alternative*  container follows:

```
<rdf:Description rdf:about="http://example.org/packages/X11"> 
  <s: DistributionSite> 
   <rdf: Alt> Gerhard Wohlgenannt - 978-3-631-75384-2
```

```
<rdf: Ii rdf: resource="ftp://ftp.example.org"/> 
      <rdf: Ii rdf: resource=" ftp: //ftp1. example. org" /> 
      <rdf: Ii rdf: resource=" ftp:/ /ftp2. example. org "/> 
    </rdf: Alt> 
  </s: DistributionSite> 
</rdf: Description>
```
Statements as in the example do not actually construct a container and its members (like in programming language), they only describe the elements of a container that presumably already exist.

Collections are similar to containers in the sense that they facilitate the grouping of resources. In contrast to containers, collections are closed and they include only the specified set of members. Containers, on the other hand, are open in the sense that anyone can provide additional members to an existing container in an RDF document distributed somewhere on the Web. RDF collections are represented by list structures in RDF graphs, they include an rdf: first member, as well as other members. A rdf: nil finally closes the list. The special property attribute rdf :parseType="Collection" indicates that the contents of the element should automatically be interpreted in a way to create the corresponding list structure in the RDF graph. The following RDF fragment exemplifies the usage of collections using the special notation:

```
<rdf:Description rdf:about="http://example.org/courses/6.001"> 
  <s:students rdf:parseType="Collection"> 
    <rdf: Description 
        rdf:about="http://example.org/students/Amy"/> 
    <rdf: Description 
        rdf:about="http://example.org/students/Mohamed"/> 
    <rdf: Description 
        rdf:about="http://example.org/students/Johann"/> 
  </s: students> 
</rdf: Description>
```
An interesting RDF concept is the so-called *reification.* Sometimes users want to specify metadata about a statement, for example who created the statement or when it was created. RDF provides a vocabulary to describe statements themselves, this is called reification of a statement. The vocabulary includes rdf: Statement, rdf: subject, rdf: predicate and rdf: object. Conventional use of reification comprises the creation of a "reification quad", i.e. four statements as given in the following example:

```
exproducts: triple12345 
exproducts: triple12345 
exproducts: triple12345 
exproducts: triple12345 
                                rdf: type 
                                rdf: subject 
                                rdf: predicate 
                                rdf: object 
                                                    rdf: Statement . 
                                                    exproducts: item10245 
                                                    exterms: weight . 
                                    "2.4""xsd:decimal . Gerhard Wohlgenannt - 978-3-631-75384-2
```
The first statement marks the resource as an rdf :Statement, the second, third and fourth describe its subject, predicate and object. Afterwards additional information about the statement, such as the author, can be added. Reification is one of the more complex subjects in RDF, and as the presented work currently doesn't use it, the interested reader is referred to online resources by the W3C, such as [112], for more details.

After having presented some of the most important concepts related to the Resource Description Framework, the upcoming section presents **RDF**  Schema. RDF Schema provides users a simple vocabulary to create their own classes and also relations between those classes.

#### **3.2.2 RDF Schema**

RDF Schema (RDFS) <sup>11</sup> , the RDF Vocabulary Description Language 1.0, provides the means to create RDF /XML vocabularies for particular domains, i.e. to specify the relevant elements (classes) and how they relate to each other. RDF Schema defines the metadata used to describe RDF data. Therefore, the RDFS terminology itself is domain-independent and the vocabularies generated with RDFS are typically domain-specific. RDFS provides a *type system* for RDF, the type system is comparable to object oriented programming languages, where *classes* with certain *properties* and instances thereof exist. RDF Schema allows class instantiation and the creation of class hierarchies (sub- and superclasses). In contrast to programming languages RDFS only describes additional information about resources, it does not force types on data.

The RDF Schema namespace is typically included as

```
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
```
so documents usually refer to the QName rdfs. The most basic element of RDF Schema is the *class,* which may be thought as the category or type of a resource. Those classes may represent almost any kind of thing, be it physical or abstract. The description of classes involves the following RDFS resources: rdfs:Class, rdfs:Resource, rdf:type and rdfs:subClass•f. For example, if someone wants to create a vocabulary in the *climate change*  domain, then he or she might define the class **GreenhouseGas:** 

<sup>J</sup>ex: GreenhouseGas rdf: type rdfs: Class .

<sup>11</sup>The specification of the RDF vocabulary description language is at **http://www.w3.**  org/TR/rdf-schema, the location presents more details about RDFS to the interested reader; http://www. w3. org/TR/rdf-primer Gerhard Wohlgenannt - 978-3-631-75384-2 /#rdf schema gives an introduction to RDFS.

The rdf: type property specifies instances of classes. Any class in RDF Schema is an instance of rdf s : Class. The statement

I ex: Methane rdf: type ex: GreenhouseGas.

creates an instance of the class ex: GreenhouseGas. The subClassOf property allows to define a specialization relation between two classes. For example

```
ex: Oi!Company rdfs: subClassOf ex:Company .
```
states that ex: OilCompany is a specialization of ex: Company, which means that any instance of ex: OilCompany is also an instance of ex: Company this fact is inferred by software that understands RDF Schema. The rdfs: subClassOf property is transitive, therefore, if

ex: Oi!Company rdfs: subClassOf ex: Company ex: Russian Oil Company rdfs: subClassOf ex: Oi!Company

then ex: RussianOilCompany is also a rdfs: subClassOf ex: Company.

*Figure 3.5:* "A Vehicle Class Hierarchy", adopted *from* [112] Gerhard Wohlgenannt - 978-3-631-75384-2 Downloaded from PubFactory at 01/11/2019 05:40:36AM via free access

Figure 3.5 shows a class hierarchy in the domain of vehicles. The figure omits the relations of each defined class to rdf **s: Class** for simplicity. The model defines various classes which represent vehicles, and also demonstrates that a class can be a subclass of multiple other classes. All classes in RDFS are implicitly subclasses of rdfs: Resource. An RDF /XML serialization of the model might be as follows: 12

```
<?xml version="l.0"?> 
<rdf:RDF 
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" 
  xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" 
  xml: base="http: //example. org/schemas/vehicles "> 
<rd fs : Class rdf: ID=" MotorVehicle "/> 
<rdfs: Class rdf :ID="PassengerVehicle"> 
  <rd fs : subClassOf rdf: resource="#MotorVehicle "/> 
</rdfs: Class> 
<rdfs: Class rdf:ID="Truck"> 
  <rdfs:subClassOf rdf:resource="#MotorVehicle"/> 
</rdfs: Class> 
<rdfs: Class rdf: ID="Van "> 
  <rdfs:subClassOf rdf:resource="#MotorVehicle"/> 
</rdfs: Class> 
<rdfs: Class rdf:ID="MiniVan"> 
  <rd fs : su bClassOf rdf: resource="#Van "/> 
  <rdfs:subClassOf rdf:resource="#PassengerVehicle"/> 
</rdfs: Class> 
</rdf :RDF>
```
rdf: ID describes the vehicle class names, which creates abbreviated URIrefs relative to the base document, and ensures that the names are unique in the document.

Next to the description of classes, RDF Schema also provides the facilities to define the specific properties of those classes. rdf : Property constructs, in combination with the additional RDF Schema elements rdfs :domain, rdfs: range and rdfs: subPropertyOf, describe properties. So every property in RDF is of type rdf: Property:


The rdf s : domain and rdf s : range properties are crucial for the application of semantic validation and inference in the method presented in Chapter 4. rdfs: domain indicates that a given property applies to instances of a particular class. The example

```
I ex: study rdfs: domain ex: Person .
```
states that the property ex: study applies to instances of class ex: Person. A property may have zero, one, or more than one rdf s: domain restrictions. If no rdfs: domain is given, nothing is said about which resources the property is applied to. If there is one rdfs: domain stated, then the property applies to instances of that specific class. If multiple rdfs: domain properties are given, then the resources have to be an instance of *all* these classes.

Similar to the rdfs: domain property, rdfs: range indicates that the values of the property are instances of a particular class. The statement

lex:study rdfs:range ex: Topic .

declares that the values (objects) of the property ex:study are instances of the class ex: Topic. Like rdfs: domain, a property can have zero, one or more than one rdfs: range descriptions. The remarks given on this subject for rdfs: domain hold analogously for rdfs: range. Next to indicating the class instance that a property has as its value, rdfs: range can also restrict the value to a *typed literal.* The following statement specifies that the value for the property **ex: age** is of type **xsd:** integer:

I ex: age rd fs : range xsd: integer .

The subsequent listing gives a more extensive example. It illustrates the application of rdfs: domain and rdfs: range together with collections (see Section 3.2.1 ). The snippet describes the domain and range restrictions for the property *study* as the union of a number of classes defined in another ontology ( denoted with the QN ame cl : ) . All resources involved in a *study*  relation as subject resource are instances of one of the classes cl: Person, cl: Organization, etc., and the values of the relation are instances of cl: AbstractTopic, etc. The next section will introduce the OWL terminology used.

```
<owl: Object Property rdf: ID=" study"> 
    <rdfs: domain> 
       <owl: Class> 
         <owl: union Of rdf: parseType="Collection "> 
            <owl:Class rdf:about="cl:Person"/> 
            <owl:Class rdf:about="cl:0rganization"/> 
            <owl:Class rdf:about="cl:Unknown"/> Gerhard Wohlgenannt - 978-3-631-75384-2
                       Downloaded from PubFactory at 01/11/2019 05:40:36AM
                                                           via free access
```

```
</owl: union Of> 
      </owl: Class> 
    </rdfs :domain> 
    <rdfs: range> 
      <owl: Class> 
        <owl: union Of rdf: parseType="Collection "> 
           <owl:Class rdf:about="cl:0bjectTopic"/> 
           <owl:Class rdf:about="cl:AbstractTopic"/> 
           <owl:Class rdf:about="cl:Unknown"/> 
        </owl: union Of> 
      </owl: Class> 
    </rdfs: range> 
</owl: ObjectProperty>
```
The architecture applied in the present work describes instances, which are represented with their corresponding DBpedia entry in this example, with statements such as:


Similar to classes, the rdfs: subPropertyOf element provides support for the specialization of properties. The construct indicates, that instances, which have *subPropertyOJ* relations to other resources also have the respective property (parent) relation to the resource.

The domain and range properties presented in this section - in contrast to the use of properties in programming languages - are independent of the class they are described for. This means that there is usually only one property with a specific name (for example *study)* defined in the domain, or even independent of a domain. In RDFS it is not possible to redefine or restrict properties locally. The application determines the way that properties are interpreted, it can use the property as a constraint, or rather as some kind of additional description which helps to infer statements.

There are a few more built-in properties in RDFS, which are intended for the documentation of RDF schemas and for resources. Those properties include, among others, rdfs: label and rdfs: comment. rdfs: label optionally specifies a human-readable label for a resource and rdfs: comment gives a (human-readable) description of a resource.

RDF Schema provides some simple facilities to define typed concepts, to create taxonomies among these Gerhard Wohlgenannt - 978-3-631-75384-2 classes, and to relate them to each other

with properties. The next section on OWL will supply additional terminology which is needed to create ontologies that are more expressive.

#### **3.2.3 Web Ontology Language**

OWL supports the definition and instantiation of Web ontologies. OWL provides a more expressive vocabulary than RDF and RDF Schema, along with a formal semantics. OWL goes far beyond those languages regarding the ability to express machine interpretable content and meaning. In contrast to RDF Schema, OWL includes the semantics needed to do useful reasoning tasks [173]. A high number of mature tools are available e.g. for reasoning or ontology editing. Historically OWL builds on the DAML+OIL ontology language, but it presents a revised and enhanced design 13 . The W3C provides a set of six documents14 , which describe OWL, starting from a simple introduction to a formally stated normative language definition and to use cases. The present description of OWL (with a focus on the most important features for the present work) is largely based on the "OWL Guide" <sup>15</sup> , which is the second of the six documents [173]. There are three different sublanguages of OWL: OWL Lite, OWL DL and OWL Full. The sublanguages are increasingly expressive, with the drawback of higher computational complexity. Those three species are designed for specific communities and use cases. *OWL Lite*  supports users with the capability to describe taxonomies and rather simple constraints. Cardinality constraints, for example, are restricted to O or 1. The benefit of OWL Lite is the lowest formal complexity among the sublanguages. *OWL DL* is for users that need maximal expressiveness while not losing computational completeness - OWL DL computations are guaranteed to finish in finite time. OWL DL supports all constructs of the language, but certain restrictions are mandatory, for example a class must not be an instance of another class. The name OWL DL stems from its correspondence with description logics. OWL Full yields maximal syntactical freedom but no computational guarantees. In OWL Full a class may be treated as a collection of individuals and simultaneously as simple individual. OWL Full also allows the augmentation of predefined terminology from RDF and OWL, although it is not likely that OWL reasoners will support every feature of OWL Full. As each sublanguage is an extension of its predecessor, all legal statements and conclusions from the predecessor are also legal in the more expressive sublanguage(s). The three sublanguages help users to choose the best-fitting variant according to their needs: For example, if a user does not

**<sup>13</sup>http://www.daml.org** 

**<sup>14</sup>http://www.w3.org/TR/2004/REC-owl-features-20040210/#s1.1** 

**<sup>15</sup>http://www.w3.org/TR/owl-guide** Gerhard Wohlgenannt - 978-3-631-75384-2 Downloaded from PubFactory at 01/11/2019 05:40:36AM via free access

need the more expressive restriction constructs of OWL DL, he or she should resort to OWL Lite for its desirable computational properties. All OWL documents are RDF documents, and every RDF document is also an OWL Full document, as OWL Full can be viewed as an extension of RDF [173]. OWL Lite and DL on the other hand are extensions of a restricted view on RDF, so only some RDF documents are legal OWL Lite or OWL DL documents.

The design of OWL is geared towards the distributed and open Web environment. OWL ontologies can import other ontologies from the Web, and distributed sources are allowed to extend existing ontologies by adding new facts, but those sources can never delete statements. As already mentioned, OWL ontologies are RDF documents. The structure of a typical OWL document is as shown in the listing below. The snippets are extracted from a handcrafted ontology (denoted as *classification ontology,* used in Chapter 4), which was built in Protege. <sup>16</sup>

```
<?xml version="1.0"?> 
<rdf :RDF 
    xmlns:rdf="http://vvv.v3.org/1999/02/22-rdf-syntax-ns#" 
    xmlns: protege= 
    "http://protege.stanford.edu/plugins/ovl/protege#" 
    xmlns:xsp="http://vvv.ovl-ontologies.com/2005/08/07/xsp.ovl#" 
    xmlns:owl="http://vvv.v3.org/2002/07/ovl#" 
    xmlns: xsd="http: / /vvv. v3. org/2001/XKLSchema#" 
    xmlns:swrl="http://vvv.v3.org/2003/11/svrl#" 
    xmlns:swrlb="http://vvv.v3.org/2003/11/svrlb#" 
    xmlns="http://veblyzard.net/rel-det-2009/rd#" 
    xmlns: rdfs="http: //vvv. v3. org/2000/01/rdf-schema#" 
    xml:base="http://veblyzard.net/rel-det-2009/rd"> 
  <owl: Ontology rdf: about=""> 
    <rdfs :comment 
        rdf:datatype="http://vvv.v3.org/2001/XKLSchema#string"> 
  Relation Detection Classification Ontology</rdfs :comment> 
    <rdfs: label 
        rdf:datatype="http://vvv.v3.org/2001/XKLSchema#string"> 
  Classification Ontology</rdfs: label> 
  </owl: Ontology> 
  <owl:Class rdf:ID="Person"/> 
  <owl:Class rdf:ID="0bjectTopic"> 
[ ... I 
</rdf :RDF>
```
<sup>16</sup>http://protege.stanford.edu Gerhard Wohlgenannt - 978-3-631-75384-2 Downloaded from PubFactory at 01/11/2019 05:40:36AM via free access

The example OWL file starts with an rdf :RDF tag to mark the content as RDF. Several XML namespace declarations make the rest of the document more readable, and they define the base URI, etc. XML ENTITY definitions in a DOCTYPE declaration preceding the rdf : RDF tag might add further abbreviation definitions. Section 3.2.1 gives more details on abbreviation declarations. The owl: Ontology tag groups the ontology headers, which are specific for the OWL document. This set of statements collects metadata about the ontology. An rdf : about attribute states a name or reference for the ontology, if missing the base URI is used as name. Among the common properties of the owl: Ontology element are rdfs: comment and rdfs: label in order to add comments and natural language labels. The metadata section may include version information about the ontology, such as current version (with the **owl :versioninfo** property), the prior version, and compatibility information regarding prior versions. **owl: imports** provides the mechanisms to import other ontologies, this brings the entire set of assertions in that ontology into the current one. **In** addition, importing another ontology recursively includes referenced ontologies. Finally, and most importantly, the declarations of OWL classes, properties and instances follow - which the rest of this section focuses upon.

Most of the power in ontological reasoning comes from class-based reasoning. The most basic class is the built-in owl :Thing. All individuals are a member of this class, and all user-defined classes are (implicitly) subclasses of owl: Thing. On the other side of the spectrum is owl: Nothing, which contains no members, is the most specific class, and it is implicitly a subclass of all other classes. Each *class* is defined as an instance of owl: Class and the attribute rdf: ID specifies its resource's name, for example:

```
<owl:Class rdf:ID="Person"/> 
<owl: Class rdf: ID=" Obj ectTopic 11 />
```
The reference with an rdf: about attribute allows the extension of an existing class specification, which is critical in distributed ontology definitions.

The rdfs: subClassOf property is the fundamental constructor for building *taxonomies.* It is a transitive property. The subsequent snippet defines the class *ObjectTopic* as a subclass of *Topic* 

```
<owl: Class rdf: ID=" Obj ectTopic "> 
  <rdfs: subClassOf rdf: resource="#Topic"/> 
</owl: Class>
```
A class description includes two parts: the introduction of a name and a list of restrictions. It is important to note, that the class definition restricts the instances of a class. The instances satisfy all restrictions, i.e. instances belong to the intersection of restrictions. Gerhard Wohlgenannt - 978-3-631-75384-2

*Individuals* (instances), the members of classes, are assigned to a class with statements such as the subsequent one, which describes *NASA* as an instance of *Organization.* 

```
J<cl: Organization rdf:ID="NASA">
```
The distinction between classes and individuals is sometimes a challenging question, as it is not always clear if an object should be modeled as a class or as an individual. A class is basically a collection of properties that describe a set of individuals. Therefore, classes should correspond to naturally occurring sets of things in a particular domain. In contrast individuals correspond to actual entities. Subclasses refer to subsets of members of the parent class, and instances are incarnations of those members. The ontology engineer decides upon the conceptualization based on the level of representation and granularity of the domain specification.

Besides the building of taxonomies, properties provide the means to state general facts about the members of classes and specific facts about individuals. **OWL** distinguishes two types of properties: datatype properties and object properties. Datatype properties describe relations between instances of classes and literals. OWL recommends the use of RDF literals, or simple XML Schema datatypes. Object properties specify relations between instances of two classes.

OWL ontologies frequently use domain and range restrictions defined in the RDF Schema vocabulary, for example:

```
<owl: ObjectProperty rdf: ID="madeFromGrape "> 
  <rdfs :domain rdf: resource="#Wine"/> 
  <rdfs:range rdf:resource="#WineGrape"/> 
</owl: ObjectProperty>
```
If an ontology defines multiple restrictions, as in the listing above, this implicitly represents a conjunction, i.e. the property has a domain of *X and*  as well as range of *Y.* If multiple domains are defined, then the actual domain is the intersection of all these restrictions. In contrast to programming languages, where type definitions serve for type checks, a reasoner uses this information to infer the type of individuals.

The definition of *property characteristics* provides mechanisms for enhanced reasoning support for properties. Among those characteristics is transitivity. The following snippet from the OWL guide [173] describes the *locatedln* property as transitive; a reasoner can therefore deduce that the #SantaCruzMountainsRegion is located in #USRegion [173].

<owl: ObjectProperty rdf: ID=" **locatedln** "> <rdf:type **rdf:resource="towl;TransitiveProperty"** Gerhard Wohlgenannt - 978-3-631-75384-2 /> Downloaded from PubFactory at 01/11/2019 05:40:36AM via free access

```
<rdfs: domain rdf: resource="&owl; Thing" /> 
  <rdfs:range rdf:resource="#Region" /> 
</owl: Object Property> 
<Region rdf: ID="SantaCruzMountainsRegion "> 
  <locatedln rdf:resource="#CaliforniaRegion" /> 
</Region> 
<Region rdf: ID=" CaliforniaRegion "> 
  <locatedln rdf:resource="#USRegion" /> 
</Region>
```
The *SymmetricProperty* entity specifies a property as symmetric, i.e. for any *x* and *y: P(x, y) iff P(y, x).* An example for this construct is the property *adjacentRegion.* The *FunctionalProperly* is appropriate, if a distinct individual is associated with only one value for the particular property. The *hasBirthMother* property would be a typical example, as any individual has a unique mother. The **owl: inverseOf** element states that for all individuals *x* and *y: Pl(x,y) iff P2(y,x),* i.e. the property Pl is the inverse of P2. In a wine ontology the properties *hasMaker* and *produces Wine* typically have this characteristic.

Property restrictions further constrain the application of properties in specific contexts in a variety of ways. **All** property restrictions are defined in OWL within the context of an **owl :Restriction** element. **A rdfs: subClassOf** clause encloses the restriction description, it defines an unnamed class that represents the set of things which satisfy the restriction. An **owl:** onProperty element specifies the property to be constrained. Property restrictions are also called local restrictions as they do not apply to all individuals. These restrictions are local to their containing class definition. The owl: all ValuesFrom restriction implies that for every instance of a class the values (objects) of the respective property are members of the class indicated by the owl: all ValuesFrom clause. The following example makes this idea more vivid:

```
<owl: Class rdf: ID="Wine "> 
  <rdfs: subClassOf> 
    <owl: Restriction> 
      <owl: onProperty rdf: resource="#hasMaker" /> 
      <owl: allValuesFrom rdf: resource="#Winery" /> 
    </owl: Restriction> 
  </rdfs: subClassOf> 
</owl: Class>
```
The statements above restrict the **hasMaker** property for the class *Wine*  to members of the class *Winery,* i.e. Gerhard Wohlgenannt - 978-3-631-75384-2 the maker of a *Wine* must be a *Winery.* 

The **owl: someValuesFrom** property yields a similar type of restriction: At least *one* of the members having the property must connect to a member of the class mentioned as value of *some ValuesFrom.* 

Another type of property restrictions are *cardinality* constraints. **owl:**  cardinality permits to exactly specify the number of elements in a relation. The following example states that every vintage has exactly one vintage year:

```
<owl:Class rdf:ID="Vintage"> 
  <rdfs: subClassOf> 
      <owl: Restriction> 
          <owl:onProperty rdf:resource="#hasVintageYear"/> 
          <owl: cardinality rdf:datatype= 
         "&xsd; nonNegati velnteger ">1</owl: cardinality> 
      </owl: Restriction> 
    </rdfs: subClassOf> 
</owl: Class>
```
In OWL Lite cardinality expressions are limited to values of O and 1. OWL DL allows all positive integer values. The properties owl: minCardinali ty and **owl:** maxCardinali ty describe lower and upper bounds, if preciser restrictions are necessary.

OWL supports the mapping of ontologies at the level of classes, properties and individuals. Mapping and merging ontologies is an important task, as ontologies should be widely shared and reused in order to have maximal impact, and to avoid the cumbersome task of building ontologies from scratch. The **owl:** equi valentClass tag indicates that two classes have exactly the same members.

```
<owl:Class rdf:ID="Wine"> 
  <owl: equivalent Class rdf: resource="&vin; Wine"/> 
</owl: Class>
```
Two individuals are declared as identical with **owl: sameAs,** the property is commonly applied to state that individuals described in different documents are actually the same. The following example from the DBpedia page about "Fossil fuel" links the resource to a corresponding resource in Freebase.com:

```
<rdf: Description 
   rdf:about="http://dbpedia.org/resource/Fossil_fuel"> 
  <owl: sameAs xmlns: owl="http://www. w3. org/2002/07 / owl#" 
      rdf: resource= 
  "http://rdf.freebase.com/ns/guid.9202a8c0641f80dd"/> 
</rdf: Description>
```
On the contrary, **owl: differentFrom** states that values are mutually distinct. Gerhard Wohlgenannt - 978-3-631-75384-2

Downloaded from PubFactory at 01/11/2019 05:40:36AM via free access

```
<WineSugar rd f: ID=" Dry" /> 
<WineSugar rdf: ID=" Sweet"> 
  <owl: different From rdf: resource="#Dry" /> 
</WineSugar>
```
When combined with a cardinality restriction that a wine has only one *has-Sugar* relation, these statements prevent a wine from being described as both dry and sweet.

The owl: AllDifferent element, combined with owl: distinctMembers, gives a more convenient way to define distinct members then to state that resources are pairwise distinct [ 173].

```
<owl: AIIDifferent > 
  <owl: distinctMembers rdf: parseType="Collection "> 
    <vin: WineColor rdf: about="#Red" /> 
    <vin: WineColor rdf: about=" #White" /> 
    <vin: WineColor rdf: about="#Rose" /> 
  </owl: distinctMembers> 
</owl: AIIDifferent >
```
OWL provides additional constructs for class creation in the form of *class expressions.* Basic set operations, enumerations, or the explicit statement of contained individuals support the generation of complex classes. Set operations include the OWL constructs intersectionOf, unionOf, complementOf, all of which are applied to owl: Class constructs. An example for intersections defines *Burgundy* as wines that have at least one *locatedln* property with the value *BourgogneRegion.* 

```
<owl: Class rdf: about="#Burgundy "> 
  <owl: intersection Of rdf: parseType="Collection"> 
    <owl:Class rdf:about="#Wine" /> 
    <owl: Restriction> 
      <owl: onProperty rdf: resource="#locatedln" /> 
      <owl: has Value rdf: resource="#BourgogneRegion" /> 
    </owl: Restriction> 
  </owl: intersectionOf> 
</owl: Class>
```
The definition of union constructs is usually a little simpler, as shown in this self-explanatory example:

```
<owl:Class rdf:ID="Fruit"> 
  <owl: union Of rdf: parseType="Collection "> 
    <owl:Class rdf:about="#SweetFruit" /> 
    <owl:Class rdf:about="#NonSweetFruit" /> 
  </owl: union Of> Gerhard Wohlgenannt - 978-3-631-75384-2
                    Downloaded from PubFactory at 01/11/2019 05:40:36AM
                                                     via free access
```
I </owl: Class>

A fragment from the classification ontology (see Chapter 4), gives an example for defining the domain restrictions for the property *study* using the **owl:** unionOf element.

```
<owl: Object Property rdf: ID=" study"> 
    <rdfs: domain> 
      <owl: Class> 
        <owl: union Of rdf: parseType="Collection "> 
           <owl:Class rdf:about="#Person"/> 
           <owl:Class rdf:about="#0rganization"/> 
           <owl:Class rdf:about="#Unknown"/> 
         </owl: unionOf> 
      </owl: Class> 
    </rdfs :domain> 
</owl: Object Property>
```
Finally, as the last of the mentioned set operations, the construct **owl:**  complement•f selects all individuals from the domain that are not members of a specified class.

OWL also provides the means to explicitly list the members of a class with the owl: one•f construct. owl: one•f completely specifies the class members, no other members may be added to the class afterwards. The following example (from the OWL Guide [173]) defines the class *WineColor*  as enumeration of the individuals White, Rose and Red.

```
<owl: Class rdf: ID="WineColor"> 
  <rdfs:subClassOf rdf:resource="#WineDescriptor"/> 
  <owl: oneOf rdf: parseType="Collection"> 
    <owl: Thing rdf: about="#White"/> 
    <owl: Thing rdf: about=" #Rose"/> 
    <owl: Thing rdf: about="#Red "/> 
  </owl: oneOf> 
</owl: Class>
```
The owl: dist inctWi th construct defines that a member of a given class cannot also be a member of another classes listed as values of the *distinct With*  property.

## **3.3 Querying and Reasoning**

The mechanisms introduced in the previous sections about representation languages for ontologies are not sufficient to leverage the full potential of the Semantic Web. Besides defining vocabularies and statements, it is necessary to have techniques and tools to query Gerhard Wohlgenannt - 978-3-631-75384-2 the datasets, as well as to have support

for reasoning on semantic data. Section 3.3.1 discusses the SPARQL and RDQL query languages for Semantic Web data, the Sections 3.3.2 and 3.3.3 briefly introduce the Jena toolkit and the Redland RDF libraries. Those frameworks support a broad range of features, among which are parsing RDF data, building graph models or querying and reasoning.

## **3.3.1 SPARQL and RDQL**

RDF graphs are kept in RDF stores, also called triple stores. A triple store is in some respect similar to relational databases or XML stores. As in database management systems, query languages are the typical means to access triple stores. This section gives an overview over SPARQL, which is the W3C standard for RDF query languages, and also touches RDQL, a predecessor ofSPARQL.

#### **SPARQL**

In 2008 the W3C made its standardized query language SPARQL a W3C Recommendation. SPARQL's name is a recursive acronym which stands for SPARQL Protocol and RDF Query Language. SPARQL is a successor of query languages such as RDF Query Language and RDQL. SPARQL queries are centered around patterns which are matched against an RDF graph. Those graph patterns are constructed from the most basic element, the triple pattern. A triple pattern looks similar to an RDF triple already presented in Section 3.2.1, but variables replace some of the elements (subject, predicate, object) in the triple. The? (or\$) symbols are prepended to the variables. A few examples of such triple patterns follow:

```
?a rdf: type dbpedia: Organization. 
<http:// dbpedia. org /resource/ ALGore> owl: sameAs ? c.
```
The patterns are to be read as: Which resource in the graph is of type **dbpedia: Organization?** What objects are marked being the *same as* the DBpedia resource for Al Gore? The syntax of triple patterns is very simple: subject, predicate, object and finally a dot. A query engine returns the entities that match the given query pattern, either in a table format, or as a resulting RDF graph. **A** set of triple patterns makes up a *graph pattern,*  **SPARQL** uses braces to enclose this list of triple patterns. It is important to note that a variable that appears in two or more triple patterns has to match the same resource in the graph. Examples for graph patterns are:

```
{ ?a rdf: type dbpedia: Organization.
```

```
{ ?a rdf:type ex:Person. 
  ?a ex:born ?c. 
  ?c geo: isln geo: Austria.
```
The first graph pattern extracts all **dbpedia: Organization** organizations and the date when they were formed. The second query selects resource names and locations for resources of type ex: Person which were born in a location inside geo: Austria. In graph patterns *all* of the triple patterns must match, and every occurrence of a variable must match the same resource [8]. The **UNION** operator allows the combination of triples in graph patterns, as shown below:

```
{ { ?a rdf:type 
 UNION 
  { ?a rdf: type 
                     dbpedia: Organization. 
                     d bpedia: Person. }
```
**SPARQL** supports different output formats, such as simply to list the appropriate bindings for variables, or also to return the complete subgraph of statements matching the query. The **SELECT** form returns the binding list, it generates a table of values corresponding to the variables. This is presented in an example query, which extracts the name and email-address (specified with **FOAF** vocabulary) of resources from an RDF model:

```
PREFIX foaf: <http://xmlns.com/foaf/O.l/> 
SELECT ?name ?mbox 
WHffiE 
  { ?x foaf:name ?name 
    ?x foaf:mbox ?mbox}
```
The PREFIX keyword in the first line associates a label *(loaf)* with an IRI (<http://xmlns.com/foaf /0. 1/>), the colon concatenates the prefix name and the local name. The prefix name or the local part may stay empty. The prefixes are similar to the QNames presented in Section 3.2.1. The **SELECT**  clause lists variables to appear in the query results. In the given example the variables *name* and *mbox* must appear in the results, but not the variable *x.*  The **WHERE** clause includes the graph pattern matched against the data.

The result of the query is a solution sequence, with zero, one, or multiple solutions to the graph pattern. Table 3.2 gives the resulting sequence for the previous query:

In addition to resources, RDF graphs also include literals - SPARQL supports plain and typed literals in queries. The first query in the example below looks for triples with the literal Gerhard Wohlgenannt - 978-3-631-75384-2 "cat" as object, whereas the second


*Table 3.2:* Query result for a **SELECT** query f 135]

query shows the use of arbitrary datatypes:

```
SELECT ?v WHERE { ?v ?p II cat 11 } 
SELECT ?v WHERE { ?v ?p 
    11 abc 11 "<http://example.org/ datatype#specialDatatype > }
```
As an alternative to the SELECT clause, SPARQL supports the CONSTRUCT mode. The application of CONSTRUCT produces a new graph matching the input pattern. Prud'hommeaux and Seaborne [135] give an example for the CONSTRUCT query form. CONSTRUCT returns a number of RDF triples, which can be serialized to RDF /XML.


Another import capability of SPARQL are term constraints, i.e. FILTER constructs which restrict solutions to elements where the filter expression evaluates to *True.* Filters are included into the graph patterns and filter functions like **regex** operate on RDF literals:

```
SELECT ?title 
WHERE { ?x de: ti tie ? ti tie . 
           FILTER regex(?title, 11 -SPARQL")
```
The FILTER construct also applies to arithmetic expressions:

```
SELECT ?title ?price 
WHERE { ?x ns: price ? price . 
          FILTER (?price< 30.5) 
           ?x de: title ?title . }
```
For more information about other *term constraints* the interested reader is referred to the W3C Recommendation on SPARQL [135]. The document includes many other SPARQL features not mentioned in this brief introduction such as: the handling of RDF constructs (blank nodes and RDF collections, etc.), more details on graph patterns and filtering, optional pattern matching, modifications to solution sequences, Gerhard Wohlgenannt - 978-3-631-75384-2 the **ASK** and DESCRIBE query forms.

## **RDQL**

This section introduces the reader to some facts about RDF Data Query Language (RDQL), because the present work relies on RDQL for some querying tasks in connection with the Redland framework (information about Redland follows in Section 3.3.3). RDQL, like SPARQL, allows the extraction of information from RDF graphs. The basic constructs used to achieve this goal are graph patterns. The syntax differs in some points from SPARQL, and the SPARQL language is more expressive than RDQL. Some of the features missing in RDQL are: the sorting of results, the ability to add optional information to query results, expressive testing (RDQL only has crude support for datatypes) and named graphs. In order to give an impression of RDQL and its syntax, the current section present a few example queries. The present work used RDQL together with the RDF Query Library, which Redland builds upon for its RDF querying facilities.

SELECI' ?a ?b **WHffiE** (?a ?b dbpedia: Person) USING dbpedia **FOR**  <http://dbpedia.org/ on to logy/>

The query presented above selects all subjects and predicates from an RDF model which have dbpedia:Person as their object. RDQL is quite similar to SPARQL in the way it uses graph patterns and variables. A major difference concerns the specification of namespace prefix declarations. In contrast to SPARQL, RDQL declares such prefixes with the **USING** keyword as part of the query:

```
SELECI' ?resource 
WHffiE (?resource info:age ?age) 
AND ? age >= 24 
USING info FOR <http://example.org/ people Info#,>
```
The example also demonstrates another difference to SPARQL. Instead of the FILTER construct, RDQL applies an **AND** clause.

### **3.3.2 Reasoning with Jena**

Jena17 is a Java-based framework for building Semantic Web applications. It provides programmatic support for W3C's Semantic Web recommendations RDF, RDFS, OWL and SPARQL, as well as for its own query language RDQL, and also includes rule-based inferencing. Jena is free software (open source), and originates from work at HP Labs' Semantic Web Programme. <sup>18</sup>

18http://wv.hpl.hp.com/semweb Gerhard Wohlgenannt - 978-3-631-75384-2 Downloaded from PubFactory at 01/11/2019 05:40:36AM via free access

<sup>17</sup>http://jena.sourceforge.net

At the heart of the Jena RDF toolkit is the RDF graph. Jena perceives reasoning support for RDFS and OWL as graph-to-graph transformations, which produces graphs of virtual triples [29]. Jena includes rich APis for building models and for handling RDFS and OWL ontologies. The first release of Jena [116] was in 2000, the Jena2 series started in 2003. Besides its model API for the manipulation of RDF graphs, Jena provides an RDF- /XML parser, as well as 1/0 modules for N3, N-triples and RDF /XML. The framework yields modules to store graphs in memory, or persistently in native persistence engines or in relational databases.

The present work uses Jena to create inferred models from input OWL ontologies such as the DBpedia ontology or the OpenCyc ontology. Those models are saved persistently to a PostgreSQL database. The inferred models include the original statements from the input ontology and the statements inferred. The Jena2 inference subsystem allows to plug various inference engines or reasoners into Jena. The inference mechanism permits the application of languages such as **RDFS** and **OWL** to create additional facts from instance data and class descriptions. The mechanism is quite general though, it is based on a generic rule engine which can be applied to many RDF processing and transformation tasks.

### **3.3.3 Redland**

Redland19 is a set of free libraries written in C. Redland allows for storage, querying and manipulation of RDF models [12]. It is designed to be flexible and modular. Furthermore, it aims at portability and computational performance. Similar to Jena, Redland provides mechanisms to store RDF models in memory, or persistently in databases, triple stores, as well as in files. The major building blocks of Redland are four libraries:


SPARQL query languages, and also LAQRS, which is an experimental set of syntax extensions for SPARQL.


The framework built for the present thesis makes use of Redland for processing DBpedia and Freebase resources, i.e. to build RDF models and execute queries in the RDQL and SPARQL languages. Besides its rich set of features, Redland was chosen for its simplicity of use in combination with the Python programming language.

## **3.4 Public Datasets and Ontologies**

This section introduces the external datasets which were used in the course of the present work, and also two ontologies linked to the datasets: the OpenCyc and the DBpedia ontology. We focused on the DBpedia dataset, information from the Freebase dataset complements the DBpedia statements. These two sources provide structured information on a wealth of cross-domain topics. DBpedia yields structured data extracted from Wikipedia and covers over two million "things". DBpedia is heavily interlinked with other datasets from the Linking Open Data project, most relevant for the present work are outgoing links to Freebase and to concepts from the OpenCyc ontology. The presented method maps concept labels to entries in DBpedia and then tries to infer a *concept type* according to a given set of types with the help of a number of heuristics and ontology reasoning.

### **3.4.1 DBpedia**

Over the last years, Wikipedia evolved into one of the central knowledge sources of mankind, and in contrast to traditional encyclopedias, it is a community-based project maintained and constantly enhanced by thousands of voluntary contributors around the globe. DBpedia leverages this comprehensive source of knowledge, it extracts structured information from Wikipedia and then provides the information on the Web [18]. Bizer et al. [18] demonstrate the size of DBpedia: Gerhard Wohlgenannt - 978-3-631-75384-2 "The resulting DBpedia knowledge base currently describes more than 2.6 million entities, including 198,000 persons, 328,000 places, 101,000 musical works, 34,000 films, and 20,000 companies. The knowledge base contains 3.1 million links to external Web pages; and 4.9 million RDF links into other Web data sources." The characteristics of information in Wikipedia and therefore in DBpedia are, next to the sheer size, that it covers many domains and builds on real community agreement on the topics discussed. Other advantages are true multilingualism and the automatic evolution of DBpedia along with changes in Wikipedia.

Bizer et al. [18] list three major contributions of DBpedia: its extraction framework that builds the knowledge base, the provision of Web-dereferenceable identifiers for entities, and the linkage between DBpedia and other data sources. The current section will cover those contributions in more detail in the following.

#### **Extraction Framework**

The information extraction framework aims at building a rich multi-domain knowledge base from Wikipedia content. Besides free text, Wikipedia includes structured information in the form of infoboxes, as well as categorization information, images, links to external resources, redirects, disambiguation pages, etc. DBpedia builds on this structured information to generate its knowledge base. A number of extractor components, geared towards specific Wikipedia structures accomplish the actual extraction task. This process results in triple data about the corresponding resource. Bizer et al. [18] present the details of the extraction architecture. The following RDF /XML snippet from the DBpedia page on "Al Gore" gives an impression of a typical DBpedia resource:

```
<rdf :RDF 
   xmlns: rdf=" http://www. w3. org/ 1999/02/22-rdf-syntax-ns#" 
   xmlns: rdfs="http://www. w3. org/2000/01/rdf-schema#"> 
<rdf: Description 
   rdf:about="http://dbpedia.org/resource/Al_Gore"> <rdfs:label 
   xml: lang="en">Al Gore</rdfs: label> 
</rdf: Description> 
<rdf: Description 
    rdf:about="http://dbpedia.org/resource/Al_Gore"> 
   <rdfs:comment xml:lang="en">Albert Arnold Gore, Jr. (born 
   March 31, 1948) is an American environmental activist , 
   author, businessperson, former politician, and former 
   journalist. He served as the forty-fifth Vice President of 
   the United States from 1993 to 2001 under President Bill 
   Clinton. </rdfs :comment> 
</rdf: Description> Gerhard Wohlgenannt - 978-3-631-75384-2
                    Downloaded from PubFactory at 01/11/2019 05:40:36AM
                                                    via free access
```

```
<rdf: Description 
   rdf:about="http://dbpedia.org/resource/Al_Gore"> 
   <dbpprop: hasPhotoCollection 
   xmlns: dbpprop="http://dbpedia.org/property/" rdf: resource= 
   "http://www4.wiwiss.fu-berlin.de/flickrwrappr/photos/Al_Gore" 
    /> 
</rdf: Description> 
<rdf: Description 
   rdf:about="http://dbpedia.org/resource/Al_Gore"> 
   <foaf: depiction xmlns: foaf="http://xmlns.com/foaf /0 .1/" 
   rdf: resource= 
   "http://upload.wikimedia.org/wikipedia/commons/thumb/ 
    d/d9/Al_Gore.jpg/200px-Al_Gore.jpg"/> 
</rdf: Description> 
<rdf: Description 
   rdf:about="http://dbpedia.org/resource/Al_Gore"> 
   <dbpprop: birthPlace 
   xmlns: dbpprop="http://dbpedia.org/property/" rdf: resource= 
   "http://dbpedia.org/resource/WashingtonX2C_D.C."/> 
</rdf: Description> 
<rdf: Description 
   rdf:about="http://dbpedia.org/resource/Al_Gore"> 
   <dbpprop: religion 
   xmlns:dbpprop="http://dbpedia.org/property/" rdf:resource= 
   "http://dbpedia.org/resource/Baptist"/> 
</rdf: Description> 
<rdf: Description 
   rdf:about="http://dbpedia.org/resource/Al_Gore"> 
   <dbpedia-owl: spouse 
   xmlns: dbpedia-owl="http://dbpedia.org/ ontology/" 
   rdf:resource= "http://dbpedia.org/resource/Tipper_Gore"/> 
</rdf: Description> 
<rdf: Description 
   rdf:about="http://dbpedia.org/resource/Al_Gore"> 
   <skos: subject 
   xmlns:skos="http://www.w3.org/2004/02/skos/core#" 
   rdf: resource= 
   "http://dbpedia.org/resource/Category:Green_thinkers"/> 
</rdf: Description>
```
The rdfs: label information corresponds to the title of the Wikipedia page, the abstract (first paragraph) is used as rdfs: comment. DBpedia provides information such as rdfs: label or rdfs: comment in a large number of different languages, if available. Wikipedia infoboxes yield statements about birth date and place, religion, etc. The category information ( "Green Thinkers") is extracted from Wikipedia's categorization structure. The complete DBpedia page about Al Gore in RDF /XML is available at http://dbpedia.org/data/Al\_Gore. Gerhard Wohlgenannt - 978-3-631-75384-2

A problem with Wikipedia's infoboxes is the use of synonymous terminology for attribute names in infoboxes, for example *birth-date* and *date-of-birth.*  The *DBpedia ontology* tackles this problem by mapping Wikipedia templates to the ontology. The ontology was created manually and includes around 170 classes in a subsumption hierarchy20 , as well as 720 ontology properties. The 350 most commonly used infobox templates including 2350 attributes are mapped to classes and properties of the ontology. The DBpedia ontology can be characterized as lightweight, the DBpedia team plans further extensions in terms of additional axioms and other constraints for future releases. As of Oct 2009 the ontology contains about 882,000 instances, where 248,000 are of type place, 214,000 of type person, 76,000 of type organization, etc.

DBpedia applies four classification schemata to categorize its resources: A SKOS21 representation of the Wikipedia category system, the YAGO hierarchy [180], the UMBEL ontology22 , and the DBpedia ontology. The present work currently makes use of the DBpedia ontology and to a minor degree of UMBEL. Extended investigations to integrate further classification schemata are postponed to future work.

#### **Provision of Identifiers**

DBpedia defines entity identifiers which are Web-dereferencable, and thereby establishes a basis for interlinking data sources on the Web. Those identifiers are globally unique and should be used according to Linked Data principles.23 DBpedia uses English article names from Wikipedia to generate its identifiers. The URls are created from a concatenation of the prefix http://dbpedia.org/resource/ with the respective article name suffix from Wikipedia, for example **Al\_Gore.** Using DBpedia URis as identifiers has a number of advantages: They cover a wide range of encyclopedic topics defined by community consensus and provide stable URis to knowledge management applications [84]. Furthermore, an extensive textual representation exists at a well-known Web location. When accessed with a Web browser, DBpedia resources deliver a simple human-readable representation of the underlying data. When accessed by other agents, for example Semantic Web crawlers, RDF /XML data is returned.

<sup>20</sup> A graphical representation of this hierarchy can be found at http: **//wwv4. wiwiss.**  fu-berlin.de/dbpedia/dev/ontology.htm

<sup>21</sup> http://wwv.w3.org/2004/02/skos

<sup>22</sup>http://wwv.umbel.org, the Upper Mapping and Binding Exchange Layer, which is a lightweight ontology for interlinking Web content and data to a standard set of subject concepts.

#### **Links Between Data Sources**

DBpedia is one of the central hubs in the emerging Linking Open Data (LOD) community project24 . The goal of the LOD project is to extend the Web of Data by publishing various open data sets as RDF and linking the data items from different data sources. DBpedia contains links to a number of external data sets, and many projects link to DBpedia or use DBpedia URis as entity identifiers. The linkage provides the foundation for many applications, namely for browsing and crawling of the Web of Data, for the fusion of data and the creation of mashups and also for the annotation of Web content based on DBpedia URis. The rest of the section focuses on outgoing links, because they are utilized in the present thesis. The DBpedia knowledge base currently contains about 4.9 million outgoing links. The DBpedia resource on *Austria* exemplifies such outgoing links:

```
<http://dbpedia.org/ resource/ Austria> owl: sameAs 
    http://umbel.org/umbel/ne/wikipedia/ Austria; 
    http:/ /sw. opencyc. org/ concept /Mx4rvViU15wpEbGdrcN5Y29ycA; 
    opencyc: Mx4rvViU15wpEbGdrcN5Y29ycA; 
    http ://www4. wiwiss. fu-berlin. de/factbook/resource/ Austria; 
    http:// data. nytimes. com/66221058161318373601;
```
The first line in the listing gives the subject (the resource) and the predicate, in this case the **owl: sameAs** property, which declares two individuals as identical. The four objects in this abbreviated syntax refer to complementary data for the resource. The objects comprise resources from Freebase, the CIA World Fact book25 , country information from Eurostat26 , and finally from OpenCyc. Outgoing links to ontologies like OpenCyc or UMBEL yield additional conceptual information about DBpedia resources, which allows for reasoning - as applied in the present work. DBpedia includes, among other outgoing links, 2,400,000 links to Freebase, 60,000 to OpenCyc and 20,000 links to UMBEL.

#### **3.4.2 Freebase**

Freebase27 is an open database built by the community for the community. It is owned and hosted by the commercial company Metaweb Technologies.<sup>28</sup> Similar to Wikipedia it contains cross-domain information - but Freebase

**26http://epp.eurostat.ec.europa.eu** 

**<sup>24</sup>http://linkeddata.org** 

**<sup>25</sup>https://vw.cia.gov/library/publications/the-world-factbook** 

**<sup>27</sup>http://vw.freebase.com** 

is not an encyclopedia, it focuses on structured information. The content in Freebase is free for anyone to query over an open API or to integrate in Web sites or applications. Freebase draws from Wikipedia and other online archives to create the initial content for the database. Users can then edit the data in a wikilike fashion, they contribute information about their areas of interest, and also modify the category system or add named relations to other resources. Currently Freebase covers millions of topics organized in hundreds of categories, for example entries on movies, people, science or sports. Freebase plays a minor role in our concept type detection component. The present work only integrates data from Freebase if an **owl: sameAs**  link from DBpedia to Freebase exists. The method mainly exploits the Freebase category system, which includes categories such as base. people or **base.organization.** 

#### **3.4.3 OpenCyc**

Cyc29 is the largest and most complete general knowledge base and commonsense reasoning engine existing. Cyc is currently being developed by Cycorp, a commercial company with around 40 employees. OpenCyc30 is an open source version of Cyc, it includes the entire Cyc ontology which comprises hundreds of thousands of terms and millions of assertions.

Cyc is an artificial intelligence project started in 1984 with the aim to build an ontology and knowledge base of common sense knowledge in order to support intelligent applications. The project bases on CycL, which is a proprietary format to represent knowledge. The CycL ontology language is grounded on first-order predicate calculus, and also includes extensions for modal operators, context and meta-level assertions [103]. In 1994 the Cyc project was spun off into Cycorp.

The two main aspects in Cyc are knowledge engineering, i.e. manually defining rules about facts in the world, and the application of reasoning techniques on those rules to generate additional rules. Assertions in Cyc are tagged with the contexts in which they are true. In this way Cyc covers common cases for each problem, instead of trying to find a single general solution [103].

We made use of OpenCyc, especially the OpenCyc ontology which is ready for download from their Web site.31 OpenCyc contains the complete Cyc ontology with all its concepts. The present work focuses on exploiting

<sup>29</sup>http://www.cyc.com

<sup>30</sup>http://opencyc.org

<sup>31</sup> http://www.opencyc.org/do"11Illoads Gerhard Wohlgenannt - 978-3-631-75384-2 Downloaded from PubFactory at 01/11/2019 05:40:36AM via free access

taxonomic relations in the OpenCyc ontology, a reasoning engine from the Jena framework supports the task.

The first three chapters laid the foundation for the remainder of the thesis. Chapter 2 introduced the broader context of the present work. A detailed description of ontologies and related tools and standards was given in Chapter 3. The next chapter will present the methodology, including fundamental techniques in ontology learning and a literature review. Most significantly, it will outline novel methods to tackle the problem of learning non-taxonomic relations in domain ontologies and combine approaches to exploit natural language text corpus data with knowledge inferred from online Semantic Web sources.

# **Chapter 4 Methodology**

This chapter focuses on the presentation of a novel approach for the labeling of unnamed relations between concepts in ontologies. Sections 4.1 to 4.4 introduce state-of-the-art methods for learning semantic relations, a review of related literature, and the webLyzard ontology extension architecture and thereby provide the foundation for the novel methods formally discussed in Section 4.5 and described regarding their implementation in Section 4.6.

The presented methods combine an approach to label non-taxonomic relations based on corpus statistics with knowledge about the relations' concepts inferred from Semantic Web data. The input to the method are a list of unnamed relations, a set of predefined relation types to choose from as well as associated ontological definitions, and a domain corpus. At first the algorithm extracts verbs co-occurring with the input concept pairs from domain text. Vector space similarity of the verb vector of the unnamed relation with relations from a knowledge base then yields relation label suggestions. Concept type information inferred via external structured data sources combined with internal ontological restrictions helps to remove invalid relation label suggestions or to decrease their similarity scores. The method itself is independent from any particular ontology learning system, but, as already mentioned, it has been developed as part of the webLyzard ontology extension architecture.

The chapter is organized as follows: Section 4.1 introduces and motivates the research of ontology learning and elaborates on typical ontology learning tasks. Section 4.2 gives an overview of fundamental techniques commonly applied in ontology learning and especially for the learning of semantic relations, including methods from computational linguistics, machine learning and statistics. It focuses on techniques used in the approach put forward in this thesis, and helps to comprehend the literature review, which follows in Section 4.3. The webLyzard ontology extension architecture (Section 4.4) Gerhard Wohlgenannt - 978-3-631-75384-2

applies some of the methods for ontology learning discussed throughout the chapter and also introduces others, it represents the foundation for the novel method introduced in this doctoral thesis. Section 4.5 gives a very detailed description of the proposed algorithms for detecting non-taxonomic relations, their interactions, and of the integration of the various components of the system. Finally, Section 4.6 provides an overview of the design and technical features of the Python-based and database-driven software components that implement the presented approach.

## **4.1 Ontology Learning**

As already emphasized in the previous chapter, ontologies play a key part in the Semantic Web as they provide its backbone. However, constructing ontologies manually is a cumbersome and expensive process [128], which relies on highly specialized human effort ( e.g. from domain specialists and knowledge engineering experts) [50]. For the success of the Semantic Web and knowledge based systems, fast and cheap ontology development is crucial - an approach for tackling this problem is to learn ontologies semi-automatically or automatically. The respective field of research is called ontology learning, which is concerned with knowledge discovery from different data sources and with its representation in an ontological structure [50]. Cimiano [37] describes ontology learning as the acquisition of a domain model from data.

Section 3.1.3 already mentioned the three possible kinds of input data used in ontology learning [13]: structured data, semi-structured data and unstructured data. Ontology learning systems extract the concepts for a domain and the relations holding between them, and eventually axioms. It is crucial that the input data is representative for the domain to be modeled [37]. Ontology learning can be seen as a reverse engineering task, which reconstructs the world model expressed implicitly by the authors of domain texts. A major problem pointed out by Brewster et al. [19] is that most domain-specific text assumes basic domain knowledge, and only the part of the domain which is the issue of the text is mentioned more or less explicitly. *Salience,* as addressed by Sowa [17 4], is another issue in ontology learning from text - the problem that people often prefer more salient terms in comparison to more precise, but less salient, terms. As an example, dogs are usually referred to as "animals", a term which has a high salience, and not as "mammals" , which would be more precise. This systematically damages the extraction of relevant terms with statistical methods. Salience is also a problem in ontology alignment, as more salient terms are sometimes wrongly preferred as concept descriptors. Gerhard Wohlgenannt - 978-3-631-75384-2

Downloaded from PubFactory at 01/11/2019 05:40:36AM via free access

#### *4.1. ONTOLOGY LEARNING* 79

Although the ontology engineering process used to be more an art or a craft [32] than a science, much effort has been put in the creation of methodologies to turn it into the latter. Cimiano et al. [39] summarize the typical ontology engineering phases: feasibility study, requirements analysis, conceptualization, and deployment. These phases form a loop of application, evaluation and maintenance of the ontology. Ontology learning can support several critical parts of these phases, for example to build an initial conceptualization of the domain, which then serves as base for discussion, or to extend and refine an existing ontology model in the maintenance phase.

It is necessary to identify the steps involved in OL in order to establish the ontology learning tasks - Buitelaar et al. [23] organize the aspects of ontology learning into a set of layers, as presented in Figure 4.1. The identification of domain concepts is possible only after the extraction of their natural language representations (symbols) - this is especially important for ontology learning from text [50]. Those lexical entries *Le* (as presented in Section 3.1.2) provide links between single words or phrases in text and the ontology's concepts. Synonym extraction helps to detect and merge redundant or very similar terms that refer to the same concept. After building a concept taxonomy *He,* which serves as backbone for the ontology, the next step focuses on the learning of non-taxonomy relations *R,* which represents the major contribution of this doctoral thesis. Finally, rules (axioms) may be defined and acquired [23] in order to derive facts that are not explicitly encoded by the ontology.

*Figure 4.1:* Ontology Learning Layers (adopted from [23]) Gerhard Wohlgenannt - 978-3-631-75384-2 Downloaded from PubFactory at 01/11/2019 05:40:36AM via free access

The following summary describes the steps involved in ontology learning, as shown in Figure 4.1, in more detail:


the concept hierarchy for those restrictions; (iv) identify a hierarchical order between the given relations


## **4.2 Fundamental Methods for Learning Semantic Associations**

This section introduces fundamental methods and techniques from various fields such as computational linguistics, machine learning and statistics. The presented methods are in no way meant to be exhaustive for those areas, as that would go far beyond the scope of this thesis. The goal is rather to provide a foundation helping to understand the material described in the upcoming Section 4.3 *Literature Review* and the section about the methods applied in the present work, Section 4.5. While this section summarizes traditional and state-of-the-art techniques and methodologies, Section 4.3 deals with the application of those methods to ontology learning, especially for the task of extraction and learning of non-taxonomic relations.

#### **4.2.1 Natural Language Processing Techniques**

#### **Preprocessing**

Natural language is the primary medium by which humans communicate with each other, it allows to ask questions, express beliefs, desires and attitudes, as well as to report events, actions and states [37]. The various syntactic categories (nouns, verbs, adverbs, etc.) are used in natural language to refer to different ontological entities. The following enumeration lists the most important syntactic categories, including their typical application and some examples: Gerhard Wohlgenannt - 978-3-631-75384-2


It has to be noted that this classification is a very rough one - natural language is rich in the ways things can be expressed, and there are many exceptions to most prototypical rules, for example nouns are often used to express events (e.g. "the climate summit at Copenhagen"). Jurafsky and Martin [94] give more detailed information on English part-of-speech.

Verbs often relate nouns with each other, a characteristic that is exploited in the present thesis. Verbs also indicate which members of classes can perform an action or participate in an event. This is exemplified in "The man drinks a glass of water", which indicates that *to drink* can be performed by members of the class *man.* Such limitations correspond to *selectional restrictions* [143]in computational linguistics, which can be seen as the conditions specifying where types of classes are applicable, regarding verbs or adjectives.

It is necessary to preprocess natural language text in order to exploit its characteristics e.g. for ontology learning when applying a more advanced analysis. Preprocessing typically includes the following steps [37] - the Natural Language Processing (NLP) application does not have to apply them in this exact order:


*Segmentation* and *tokenization* aim at detecting sentence and word boundaries. Sentence splitting or *sentence tokenization* can be done with complex regular expressions, or with binary classifiers based on machine learning techniques. A typical problem in sentence splitting is to distinguish punctuation signs such as periods in their use as abbreviations and as end-of-sentence markers. The NLP application then splits sentences into single words ( *word tokenization),* mostly relying on spaces in sentences and some additional rules. Other languages such as Chinese or Thai do not use spaces as potential word boundaries, therefore other algorithms need to be applied [94]. A *normalization* step can be integrated into *segmentation,* for example to transform occurrences of dates into a standard format. Most applications that provide tokenization include a stopword removal component to filter non-discriminating words such as "the", "many", etc. from the list of terms based on a stopword dictionary.

Part-of-speech (POS) tagging assigns the respective part-of-speech to each token. The part-of-speech denotes the syntactic word category, such as noun, verb, adjective - usually in a more fine-grained differentiation. Depending on the tagset used, computational linguistics typically distinguishes more then 40 separate parts-of-speech for the English language. A simple word dictionary is not sufficient to do POS tagging, because many words represent different POS depending on their usage and context. Frequently applied tagsets include the Penn Treebank tagset 1 [113], which contains 45 tags, and the 87-tag tagset used for the Brown corpus [61, 62]. The Brown corpus is a million-word collection of samples from various genres, assembled at Brown University in 1963-1964. The corpus was automatically tagged and then manually corrected.

Two commonly used POS taggers are Brill tagger [20] and TreeTagger [161]. Brill tagger is a transformation-based tagger, which assigns a tag to each word and then changes it using a set of predefined rules. This tagger applies lexical rules to initially assign tags, and contextual rules to refine the tags afterwards. TreeTagger is based on decision trees, which estimate transition probabilities with the help of a transition tree.

*Stemming* completely removes the endings (inflections) of words, leaving over the stem or root. For example *generat* is the stem of *generate, generating*  etc. A well-known stemmer is the *Porter stemmer* [134], which uses a simple and efficient algorithm based on a series of cascaded rewrite rules. This thesis applies a *lemmatization* approach, where different inflected forms are grouped together to their *lemma* with the help of a lexicon. So for example

<sup>1</sup>http://wvv.comp.leeds.ac.uk/amalgam/tagsets/upenn.html Gerhard Wohlgenannt - 978-3-631-75384-2 Downloaded from PubFactory at 01/11/2019 05:40:36AM via free access

*better* and *good* are replaced with *good, generating* and *generated* are reduced to *generate.* 

*Named Entity Recognition* is a subtask of information extraction that aims at recognizing unique objects, such as *Dr. Thaddeus Venture* or *Matterhorn.* Traditionally named entity recognition is restricted to detect certain entity classes, which typically include *persons, organizations, locations, dates.* State-of-the-art systems have a near-human performance, see as example Radu et al. [60], who combine four diverse classifiers (robust linear classifier, maximum entropy, transformation-based learning, and hidden Markov model).

Another preprocessing step often executed in NLP is *chunking,* also referred to as *shallow* or *partial parsing* [37]. Chunking relies on techniques such as regular expressions and finite state automata to group together words to large syntactic and meaning-bearing units. The main element of that unit is the *head,* in noun phrases in English language text the rightmost noun is generally the head, in verb phrases the verb is the main meaning-bearing unit. Those syntactic units (chunks) are non-overlapping, non-recursive, and non-exhaustive. Non-exhaustive means that some words in a sentence may not belong to a chunk. Chunkers (in contrast to syntactic parsers, see below) do not detect grammatical relations (such as subject, object etc.) nor syntactic or semantic ambiguities. Chunkers are used when no complete parse trees are needed for all inputs [94].

These basic preprocessing steps are frequently applied as prerequisite for other methods. For example Ruiz-Casado et al. [148] apply segmentation (tokenizer and sentence splitter), POS tagging, stemming, NER, and a chunker (partial syntactic analyzer) to support the extraction of relations in the process of semantic annotation of Wikipedia.

#### **Syntactic Analysis**

More complex and challenging than simple chunking is *syntactic analysis (parsing),* which aims at discovering the full syntactic structure of a given input sentence [37]. Parsing detects larger units of words and makes dependency relations explicit. It determines the grammatical structure of a sentence with respect to a given formal grammar. The result of this step is a parse tree, where the whole sentence (root of the tree) is split into smaller syntactic units recursively. There are two main strategies in syntactic parsing: bottom-up ( data-directed search) and top-down (goal-directed search). An example of a syntactic parser is LoPar [162], a parser for probabilistic context-free grammars as well as head-lexicalized probabilistic context-free grammars. Gerhard Wohlgenannt - 978-3-631-75384-2

#### Contextual Features

The extraction and representing of the context of a certain word, or of word pairs in the case of rote extractors (see Section 4.2.2), is important in many NLP applications. Context is crucial in Word Sense Disambiguation (WSD) for example, which is concerned with detecting the correct meaning of a word in a given context. The problem is often exemplified by the term *bank,* which refers to a financial institute or a seating-accommodation, depending on the context. One common way to represent context are word window models, that consider *n* words to left and right of the target word as contextual features. Cimiano [37] proposes two other approaches for extracting contextual features, which both rely on linguistic processing techniques to identify constructs such as subjects and objects of verbs, adjectives or prepositional phrases. *Syntactic dependency* processing parses a text and extracts the constructs mentioned above from the parse tree. The sentence "The cat eats an apple strudel" would result in eat\_subject(cat), eat\_object(apple strudel). *Pseudo-syntactic dependencies* apply shallow parsing combined with regular expressions to avoid the need for real syntactic parsing.

#### **4.2.2 Lexico-syntactic Patterns**

Generalizing textual patterns to identify relations has been proposed since the early 1990's, when Marti Hearst presented her seminal work on "Automatic Acquisition of Hyponyms from Large Text Corpora" [82]. The work was inspired by the pattern-based interpretation techniques used in the processing of Machine Readable Dictionaries, which were developed in the 1980's. The approach aims at extracting semantic relations from text with little understanding of the content itself by applying simple lexico-syntactic patterns. An example of such a pattern and the implied relation is:

*NP1 {,NPn}\** {,}orotherNPo. for all *NP;,* 1 :<:::: i :<:::: *n, hyponym(lemma(NP;),lemma(NP<sup>0</sup> ))* (4.1)

NP stands for a noun phrase, curly braces denote optional elements in the pattern, and the *\** indicates that O - *n* occurrences of an element are allowed. Matched on the sentence fragment "Bruises, wounds, broken bones or other injuries ... ", the pattern extracts the following hyponym relations: hyponym(bruise, injury), hyponym(wound, injury), hyponym(broken bone, injury). The function *lemma* returns the base of a word, in the example above: *injury* for input *injuries,* or *wound* for *wounds.* Hearst also sketches a procedure on how to learn new patterns for a given relation, or patterns for a new relation. This procedure basically Gerhard Wohlgenannt - 978-3-631-75384-2 relies on acquiring occurrences of the corresponding terms, and generalization of the respective phrases found in text.

Hearst [82] describes a number of patterns for extracting *is-a* relations, among which is the example given above. The difficulty lies in finding constructions that frequently and reliably indicate a relation of interest. The following characteristics are desired [82]:


So the first characteristic is concerned with the recall of the pattern, the second one with precision. But both recall and precision of the original Hearst patterns are not satisfactory, the patterns occur quite rarely in ordinary text, so large corpora are necessary. Some of the subsequent work related to lexicosyntactic patterns described below addresses the issues of raising precision and recall.

Many research papers were published inspired by the original paper of Hearst in 1992. Among those are extensions of the set of patterns [89], the application of Hearst patterns in specific contexts, the definition of patterns to extract and populate other types of relations, non-taxonomic relations [133, 3, 14, 197, 67], and the combination of Hearst patterns with methods such as Latent Semantic Indexing [30]. More recently researchers also matched the patterns on the Web using search engine APis such as the one of Google [156, 40, 41, 53] - addressing low recall as a well-known problem of Hearst patterns. For more information on the details of these approaches see Section 4.3, *Literature Review.* 

An important step in the evolution of lexico-syntactic patterns is the automatic acquisition of patterns for a list of predefined relations. This learning task is typically based on a set of hand-crafted examples per relation, and a corpus from which patterns are extracted subsequently. A well-known approach for pattern learning are rote extractors [21, 1, 140].

Rote extractors allow extracting non-taxonomic relations from text. Rote extractors look for textual contexts that happen to convey a certain relation between two concepts [7]. More precisely, rote extractors estimate the probability of a relation *r(p, q)* given the surround context *A1pA2qA3* [110]. The method of Ravichandran and Hovy [140] is often applied to train a rote extractor from the Web: The first step is to select a pair of related elements (e.g. *Dickens,* 1812 for a relation *birth-year).* A query to a Web search engine in the form of *terml* AND *term2* Gerhard Wohlgenannt - 978-3-631-75384-2 ( e.g. "Dickens AND 1812") generates a corpus. The algorithm then extracts sentences with both elements, and identifies frequently occurring patterns in those sentences. Then a *hook corpus* is downloaded, which contains just *terml,* in the example *Dickens.* The hook corpus helps to calculate the precision of each pattern as the number of times it identifies a *target* ( *term2)* related to the hook, divided by the total number of times the pattern appears. The method repeats those steps for other examples of the same relation. Rote extractors have the advantage that the collection of training corpora is easy and done automatically - and therefore they can discover many different relations from text.

Alfonseca et al. [7] argue and demonstrate that the traditional method by Ravichandran and Hovy [140] to calculate the precision of patterns is unreliable in some cases, and they suggest various improvements to better estimate the precision of rote extractors in non-taxonomic relation extraction. Among those are to also collect a *target corpus* in addition to the hook corpus in order to refine a patterns precision, and to test patterns found for a distinct relation also on hook and target corpora of other relation types, e.g. test the patterns for the relation *writer-book* also on corpora for the relation *painterpainting.* The algorithm proposed by Alfonseca et al. [7] introduces additional information per training relation, such as the cardinality of the relation ( e.g. **a**  person has only one birth year, but a birth year is shared by *n* persons), or restrictions on the hook and target to certain POS-tags or types annotated with a *named entity recognition* module. In cases of uncertainty Web search engine query patterns confirm or reject relation instances. These techniques help to define more fine-grained and precise patterns, and to detect if the method learns the same patterns for different relations.

Chagnoux et al. [33] present a semi-automatic, pattern-based approach for extracting non-taxonomic relations from text. They aim at using and discovering relation extraction patters, and enriching existing ontologies with new relations. The input to the process is an existing ontology, a set of patterns per relation (pattern base) for every relation type known to the system at that moment, and a tagged (domain) corpus - whereas the result is an enriched ontology and an enriched pattern base. The procedure is as follows: For all concept pairs ci, *ci* in the ontology the system scans the text corpus to see if any of the given relation patterns match. If a match is found, the respective relation is assumed for the concept pair. If no matches are found, they query the semantic search engine Watson [47] to retrieve a relation label for the concept pair. If the system finds a new relation type (label) via Watson, it automatically extracts patterns representing this relation from the text corpus. Finally, all new relations extracted from Watson are manually validated for relevance, and all new patterns are presented to the user for validation. Relations and patterns Gerhard Wohlgenannt - 978-3-631-75384-2 that pass the evaluation step are added

to the pattern base. Chagnoux et al. [33] argue that the manual evaluation of all new patterns and relations guarantees their semantic significance, as well as relevance.

The techniques presented rely on linguistic patterns. Linguistic patterns are highly successful in specific applications, but traditionally lack the generic ability of adding new domain-specific relation types. Approaches like the method by Chagnoux et al. [33] and the paradigm of Open Relation Extraction (see below) address this problem.

In 2007 Banko et al. [9] present a new paradigm in information extraction (IE), called Open IE. Open IE is complementary to traditional IE, which usually serves precise and narrow requests on small homogeneous corpora. Traditional IE includes the drawback that when shifting to a new domain and to new relations, the user has to manually create extraction rules and tag new training examples by hand. The manual effort needed scales linearly with the number of relations. In Open IE, the system makes a single data-driven pass over a corpus and extracts a large set of relational tuples without requiring any human input, so the runtime is constant in regards to the number of relations. Open IE is a *relation-independent* extraction paradigm that is tailored to massive and heterogeneous Web corpora. Banko et al. [10] argue that Open IE is necessary when the number of relations is massive and the relations are not pre-specified. Current Open IE methods rely on general lexico-syntactic patterns used to express relations, combined with models for contextual features and eventually additional features such as part-of-speech, named entities, etc. An Open IE extraction system operates in two phases: First it learns a general model of how relations are expressed in the language under consideration, and then it can "utilize this model as the basis of a relation-independent extractor whose sole input is a corpus and whose output is a set of extracted tuples that are instances of a potentially unbounded set of relations" [52, p 71]. Section 4.3, *Literature Review,* presents three interesting applications of the Open IE paradigm: TEXTRUNNER, an extractor for raw natural language text [9], WEBTABLES for extracting relations from structured data such as HTML tables, and a system for surfacing data from the "Deep Web".

#### **4.2.3 Relevant Statistical and Information Retrieval Measures and Methods**

This section introduces basic measures from the field of statistics and Information Retrieval (IR) used in the Gerhard Wohlgenannt - 978-3-631-75384-2 remainder of this thesis. It also includes a number of more advanced methods relevant to the task of learning relations in ontologies.

#### **Term Relevance**

For many applications it is necessary to determine the relevance of terms occurring in documents. A well-known measure for term relevance is *term frequency* - *inverse document frequency (tf-idf)* (see e.g. [153]). The measure is calculated as follows:

$$tf\_{i,j} = \{ \begin{array}{c} n \mid term \ i \ occurs \ n \ times \ in \ document \ j \} \end{array} \tag{4.2}$$

$$idf\_i = \log\_2 \frac{|D|}{df\_i} \tag{4.3}$$

$$tf \cdot idf\_{i,j} = tf\_{i,j} \cdot idf\_i \tag{4.4}$$

The first step computes the frequency in document *j* for any term i. The number of documents in the collection (!DI) and the number of distinct documents in the collection that contain the term i determine the *idf;.* The inverse document frequency penalizes terms occurring in many different documents, because they have low discriminative capabilities. Finally, the *tf-idf*  results from the multiplication of *tf;,i* and *idf;.* 

#### **Recall and Precision**

Two of the most fundamental measures in IR and related fields are *recall* and *precision,* which are defined as [153]:

> *R ll* Number of relevant items retrieved *eca* = ----------------- Total number of relevant items in collection *P* . . \_ Number of relevant items retrieved *reciswn* - 1, 1 b f . . d ota num er o items retrieve

Both measures vary from O to 1, and usually a high precision and a high recall are preferable. In information retrieval and NLP applications there typically is a trade-off between recall and precision. A middle point is best in most cases, but in doubt a higher precision and low recall is better than vice versa, especially if the collection is huge and recall is not the primary bottleneck. In the case of a NER system, a high recall means that rarely a named entity is missed, and high precision refers to a high ratio of correctly tagged named entities. Gerhard Wohlgenannt - 978-3-631-75384-2

#### **Pearson's Chi-square Test**

*Pearson's x2* is a test for statistical significance. *x2* tests are commonly used to compare observed data with data that would be expected according to a specific hypothesis. Pearson's chi-square is a prominent example of a chisquare test. It is used for two types of comparisons, the test of goodness of fit and the test of independence. Goodness of fit tests if an observed frequency distribution differs from a theoretical distribution. The test of independence assesses if paired observations on two variables are independent of each other, for example if persons of different age differ in their preference for a political candidate. The test has the general form of

$$\chi^2 = \sum \frac{(O-E)^2}{E} \tag{4.5}$$

where O refers to an observed frequency, and *E* is the expected frequency, according to the null hypothesis.

Cimiano [37], inspired by Manning and Schutze [111], demonstrates the *x2* test for the problem of deciding if two terms are related based on cooccurrence in text. The test result determines the relatedness and strength of the relation, e.g. between the terms *pirate* and *treasure.* 

The function fin Table 4.1 refers to a statement like "the function's arguments appear **in** the same document", or "the function's arguments appear within a sliding window of 7 words in the same sentence".


*Table 4.1:* Example for a 2-by-2 x2 table

The following equation applies to compute the *x2* value for such a constellation:

$$\chi^2 = \sum \frac{(O\_{i,j} - E\_{i,j})^2}{E\_{i,j}} \tag{4.6}$$

In a 2-by-2 constellation *x2* expected and observed values can be substituted as follows [111]:

$$\chi^2 = \frac{N(O\_{1,1}O\_{2,2} - O\_{1,2}O\_{2,1})^2}{(O\_{1,1} + O\_{1,2})(O\_{1,1} + \textsf{@g\u0})(O\_{1}\natural\gamma\oplus\textsf{@g\u0})(O\_{2,1}^2\natural\cdots\varnothing\cdot\textsf{@g\u0}\cdot\textsf{2})}\tag{4.7}$$
 
$$\underset{\text{D\u0\prime\text{mod}}\text{added from }\textsf{Pud}\text{ factors at }0\,1\,\forall 1\,2\,0\,\textsf{9}\,\,5\,40\,\text{\u0\-A\u0}\cdot\text{2}}{\text{v\u0\-free\u0ess}}$$

For our example, given by Table 4.1, this results in:

$$\chi^2 = \frac{88(11 \cdot 50 - 15 \cdot 12)^2}{(11 + 15)(11 + 12)(15 + 50)(12 + 50)} = 4.99896\tag{4.8}$$

A lookup of the value 4.99896 in a *x2* distribution table (one degree of freedom) reveals that there is only about a 2.5% chance that the null hypothesis is correct - when assuming a significance level *a* of 0.05(5%), then the alternative hypothesis, i.e. there is a relation between *pimte* and *treasure,* is accepted.

A *x2* test should not be used if more then 10% of events have expected frequencies below 5. In the event of 1 degree of freedom the expected frequencies should be above 10 - in cases of low frequencies it is advisable to apply the Yates' correction for continuity [200]. When the total size of the sample is small, it is necessary to use an appropriate exact test, usually either a binomial test or Fisher's exact test. The *x2* test does not assume normally distributed data (like the *t-test* does), but assumes that the deviation between observed and expected values is normally distributed.

#### **Vector Space Model**

A common way to represent documents and queries in IR is the *Vector Space Model (VSM).* Documents and queries are embodied by vectors, which allows to calculate similarities between the two [153]. Another typical use for VSMs is the computation of similarities between documents, for example in the process of document clustering. The present thesis makes heavy use of VSMs for representing relations from a domain ontology based on verbs extracted from domain text which occur with respective relations (see Chapter 4 for details). The VSM is one of various models to support IR systems and procedures besides the Boolean model and probabilistic models, for example. Probabilistic models compute relevance probabilities for documents in a collection. The VSM is simplest to use and highly effective [153].

Each vector in the VSM includes features, which in classical **IR** are the terms (words) occurring in a collection of documents. The value of these features is called *term weight* [94]. The simple frequency of terms is a candidate for serving as term weight, a commonly used measure is the already described *tf-idf metric.* The extraction of features (terms) from documents typically includes filtering steps, most prominently stopword filtering. A vector for a document *d;,* composed by the document terms *aJ;,* has the general form:

Queries to the IR system are transformed into vectors as well, and have the form:

$$q\_j = (a\_{1q}, a\_{2q}, a\_{3q}, \dots, a\_{nq})\tag{4.10}$$

The dimension of the vectors is equal to the number of all different terms in the whole collection, which is denoted by *N.* In big collections this can easily result into several hundreds of thousands of terms [94], even with stopword filtering applied. Single documents and queries only contain small fractions of all the terms, therefore most of the values in the vectors will be zero. Actual applications use sparse representations such as hashes to use available resources efficiently.

Let us suppose, in a very simple example, a document collection in the domain of *tennis,* including only two documents. The first document d1 contains the terms *aj1 (tennis, competition, winner, tennis, sunday, tournament, winner),* the second document *d2* includes the terms *aj2 (tennis, forehand, racket, forehand, forehand).* The first step is to compile a list of all terms *N* (from *term1* to *termn): (tennis, winner, sunday, tournament, forehand, racket),* which results in a 6-dimensional vector. Using the term frequency as term weight, the corresponding document vectors are:

$$\begin{aligned} d\_1 &= \langle 2, 2, 1, 1, 0, 0 \rangle \\ d\_2 &= \langle 1, 0, 0, 0, 3, 1 \rangle \end{aligned}$$

A user supplies a query to the systems, searching for *(tennis, tournament),*  this results in the following query vector:

Q1 = (1,0,0, 1,0,0)

Now the final step is to decide which document ( or a number of documents in a real-world system) are best-fitting for the user query, and to return that document to the user. The features of documents and queries are used as dimensions in a multi-dimensional space, and the combination of feature values correspond to a distinct point in that space. The method regards documents located "close" to the query vector in the multi-dimensional space as more relevant than documents farther away. Figure 4.2 shows a reduced 2 dimensional view on the vector space for our example - the figure is restricted to the dimensions tennis and tournament, and shows the query vector and the two document vectors. As intuition would suggest, the angle between document d1 and the query terms is smaller than the angle between *d2* and the query vector.

The upcoming paragraphs review common similarity measures, and apply some of them to the given example Gerhard Wohlgenannt - 978-3-631-75384-2 for the sake of illustration. Before

**Dimension 1: tennis** 

Figure *4.2:* Two dimensions from a simple VSM example, showing the query vector and two document vectors

discussing various measures to compute the similarity between vectors, let us briefly reflect on another application of the VSM, which is directly related to the method presented in Chapter 4. In ontology learning, the context of a word is a very important property in order to assess the similarity between words [37]. In this thesis instead of words the author applies this principle to word pairs, i.e. the two concept labels ( regular expressions) representing the concepts of a relation occurring in the same sentence. The basic idea, however, remains unchanged. The well-known *distributional hypothesis* of Harris [77] states that two words are similar to the extend that they share similar context. Empirical investigations support the correctness of Harris' hypothesis. Grefenstette [72] further demonstrates that relatedness in vector space correlates with semantic relatedness of words [37]. As in most work in ontology learning, the assumption that similarity in context corresponds with semantic similarity is a key aspect in this thesis. A common way to represent context is a vector in high-dimensional space, the interesting question is what features are extracted to serve as context of a word (or relation). Cimiano [37] list various alternatives on how do define context: One alternative is to define the whole document a word appears in as context [104, 154], which leads to very high dimensional Gerhard Wohlgenannt - 978-3-631-75384-2 and computationally intensive vectors.

Other alternatives are word windows of *n* words to the left and right of the target [81, 199, 166, 192], or simply the use of the sentences where the target appears ( used in this thesis next to word windows), or specific grammatical constructs such as appositions, copulas, verb-object, verb-subject, adjective modifiers, and nominal modifiers [87, 72, 28].

**Similarity Measures for the VSM.** Cimiano [37] defines the characteristics of a *similarity measure.* It is a function *sim* : IR **x** IR -+ [O, 1], with some special properties: For a feature vector the similarity to another feature vector is O if there is no dimension where they both have non-zero values. If there is a dimension where both compared vectors have non-zero values, then the similarity exceeds 0. The maximum similarity is 1, and is given when a vector is compared to itself. Not all similarity metrics need to be symmetric. A *distance measure* is a related type of function, that can be transformed into a similarity measure by a bijective and monotonic decreasing function. One of the characteristics of a distance measure is that the distance between a vector and itself is 0.

A basic ingredient in many similarity measures is the *dot product,* also called *inner product* of two vectors, which is defined in Equation 4.11.

$$a \bullet b = \sum\_{i=1}^{n} a\_i b\_i = a\_1 b\_1 + a\_2 b\_2 + a\_3 b\_3 + \dots + a\_n b\_n \tag{4.11}$$

The dot product is no appropriate similarity measure by itself, as it is sensitive to the size of the involved vectors - it favors longer vectors and does not remain in the range of [O, 1]. Therefore the dot product needs to be normalized, typically with the vector length. Vector length exists in two variants, the "simple" vector length (Equation 4.12) and the Euclidian vector length (Equation 4.13):

$$= \sum\\_{i=1}^{n} a\\_i \tag{4.12}$$

$$|a| = \sqrt{\sum\_{i=1}^{n} a\_i^2} \tag{4.13}$$

The simplest similarity measures are the ones geared towards binary vectors. The values in binary vectors are in the range of {O, 1 }, i.e. a feature is present are not. The *Dice* and *Jaccard* coefficients are two traditional IR measures, they both combine dot product and variants of vector length defined above. Gerhard Wohlgenannt - 978-3-631-75384-2

Downloaded from PubFactory at 01/11/2019 05:40:36AM via free access

$$Dice(a,b) = \frac{2a \bullet b}{+ < b>} = \frac{2\sum\\_{i=1}^{n} a\\_i b\\_i}{\sum\\_{i=1}^{n} a\\_i + \sum\\_{i=1}^{n} b\\_i} \tag{4.14}$$

$$Jaccard(a, b) = \frac{a \bullet b}{+  \*\*- a \bullet b} = \frac{\sum\\_{i=1}^{n} a\\_i b\\_i}{\sum\\_{i=1}^{n} a\\_i + \sum\\_{i=1}^{n} b\\_i - \sum\\_{i=1}^{n} a\\_i b\\_i} \tag{4.15}\*\*$$

The two given measures for binary vectors have been adopted to vectors containing weighted features. Grefenstette [72] adopted the Jaccard measure *as* follows:

$$Grefenstette/Jaccard(a, b) = \frac{\sum\_{i=1}^{n} min(a\_i, b\_i)}{\sum\_{i=1}^{n} max(a\_i, b\_i)}\tag{4.16}$$

The numerator in Equation 4.16 reflects the overlapping features in the two input vectors, and the denominator serves *as* normalizing factor [94]. Curran [46] extended the Dice measure to weighted feature vectors, he uses the Jaccard numerator and replaces the denominator with the total sum of non-zero entries in the vectors.

$$Current/Dice(a, b) = \frac{2\sum\_{i=1}^{n} \min(a\_i, b\_i)}{\sum\_{i=1}^{n} a\_i + b\_i} \tag{4.17}$$

Exemplified with our example from above, Grefenstette/ Jaccard yields the following results - which favor *di* over d2:

$$Grefensteette/Jaccard(q\_1, d\_1) = \frac{1+1}{2+2+1+1} = \frac{1}{3}$$

$$Grefensteette/Jaccard(q\_1, d\_2) = \frac{1}{1+3+1} = \frac{1}{5}$$

Another, and also the most commonly used way to assess the distance and similarity between vectors is to approach the task with geometrical measures. The simplest among those is the *Manhattan distance* (also known *as Levenshtein distance* or *Ll norm,* see [941), which is defined as:

$$L\_1(a, b) = \sum\_{i=1}^{n} |a\_i - b\_i| \tag{4.18}$$

The *Euclidian distance* or *L2 norm* is defined as follows:

$$L\_2(a,b) = \sqrt{\sum\_{i=1}^n (a\_i - b\_i)^2} \tag{4.19}$$

Downloaded from PubFactory at 01/11/2019 05:40:36AM via free access

Those two measures assess the distance between the vector end points, and both stem from the more general *Lq* or *Minkowski* measure [37]. They are very intuitive, but rarely used for vector similarity *as* they obviously are very sensitive to extreme values, i.e. there is no normalization involved.

The *cosine measure* is the most frequency used vector similarity measure [94]. It is basically a normalized dot product - the dot product is divided by the products of the lengths of the vectors involved:

$$\cos(a,b) = \frac{\sum\_{i=1}^{n} a\_i b\_i}{\sqrt{\sum\_{i=1}^{n} a\_i^2 \sum\_{i=1}^{n} b\_i^2}}\tag{4.20}$$

The normalized dot product is the same *as* the cosine of the angle between the two vectors, Equation 4.21 demonstrates this observation:

$$
\cos \theta = \frac{a \bullet b}{|a| |b|} \tag{4.21}
$$

For our example presented above the cosine similarity measure yields the following results, which clearly support the intuition that document d1 is more relevant for query q1:

$$\cos(q\_1, d\_1) = \frac{2+1}{\sqrt{2+10}} = 0.87$$

$$\cos(q\_1, d\_2) = \frac{1}{\sqrt{2+11}} = 0.28$$

The cosine is not sensitive to vector length, i.e. longer documents or vectors representing more frequently occurring entities are not favored - the cosine just measures the angle between two vectors, independent of vector length. The resulting value ranges from 1 (if the vectors point in the same direction) to O (for orthogonal vectors that share no common features).

The presented ingredients, i.e. the vector space representation for documents (or other entities) and eventually queries, combined with similarity measures, allow to create an ad hoc information retrieval system. Such a system accepts a user query, transforms it into a vector, computes the similarity to documents in the collection, and then returns a similarity-ordered list of documents. Ranked retrieval is one of the advantages of the vector space model, next to its simplicity and the ease with which vectors can be modified. One of the downsides is that the vector space model assumes orthogonality, and hence independence between features [153].

Another way to measure similarities bases on probability distributions. For more information about this topic and related measures such as *relative entropy, mutual information,* or Gerhard Wohlgenannt - 978-3-631-75384-2 *Jenson-Shannon* or *Skew* divergences, the

interested reader is referred to Cimiano [37] and Jurafsky and Martin [94]. Probably the most famous, and rather simple to explain, of these measures is *pointwise mutual information* [54], which relies on information on how often two events *x* and *y* co-occur, in relation to how often they should co-occur when they are independent of each other.

#### **Latent Semantic Analysis**

A document collection in information retrieval can be represented as a *termdocument matrix.* Figure 4.3 gives the general form of a term document matrix *A.* It is typically a sparse matrix in which the rows represent the documents in the collection (D1 • • • *Dm),* and the columns correspond to all the terms occurring in the whole collection (T1 • • • *Tn)-*

$$A = \begin{array}{c} T\_1 \\ D\_1 \\ D\_2 \\ \vdots \\ D\_m \end{array} \begin{pmatrix} a\_{11} & a\_{12} & & & \\ a\_{21} & a\_{22} & & & a\_{2n} \\ \vdots & \vdots & & \vdots \\ a\_{m1} & a\_{m2} & & & a\_{mn} \end{pmatrix}$$

*Figure 4.3:* Term-Document matrix *A* 

A technique that builds on matrices such as the term-document matrix is *Latent Semantic Analysis ( =LSI) (LSA)* (also called *Latent Semantic Indexing {LSI)* in information retrieval context). LSA is a mathematical method for computer modeling and simulation of the meaning of words and passages. It analyzes representative corpora of natural text and thereby closely approximates many aspects of human language learning and understanding [99]. LSA analyzes relations between a set of documents and a set of words via concepts that are generated for the terms in the occurrence matrix. It uses *singular value decomposition* to reduce the size of the term-document matrix [75]. Single Value Decomposition (SYD) is a well-known technique in matrix theory, which became practical for application to such complex problems only after the advent of powerful enough machines and algorithms to exploit them in the late 1980s [99]. Opposed to techniques such as the vector space model, which operate directly on keywords without semantic knowledge ( "surface co-occurrence"), SYD promises to approximate many aspects of human language learning and understanding. LSA vectors approximate the meaning of a word as its average effect on the meaning of the documents where it occurs, and it reciprocally approximates the meaning of documents as the average of the meaning of Gerhard Wohlgenannt - 978-3-631-75384-2 their words [99]. Possible applications for

LSA are various tasks in information retrieval, such as comparing documents in concept space ( clustering and classification) or cross-language information retrieval (finding similar documents across languages), or the detection of relations between terms via the concepts. For more information about LSA see for example the description of LSA from a rather psychological point of view [98], a deeper discussion of its mathematical aspects [119], or an early article describing the general aspects of method in some detail [49].

#### **4.2.4 Machine Learning Paradigms**

*Machine learning* is a discipline that is concerned with the automatic recognition and detection of certain patterns and regularities within data. The applications are manifold, they encompass natural language processing, machine perception, syntactic pattern recognition, biotechnology, even tasks such as credit card fraud detection or stock market analysis - to name but a few. Besides academia, industry applies machine learning methods extensively in very heterogeneous areas.

Machine learning is a sub-field of artificial intelligence [167], a definition from Samuel [155] from the early days states that machine learning is "the field of study that gives computers the ability to learn without being explicitly programmed. Mitchell [121] gives a more recent and precise definition, he calls machine learning a well-posed learning problem, where a computer program is said to learn from an experience *E* with respect to a task *T* and a performance measure *P* - if the performance in learning the task is improved by the experience *E.* So machine learning bases on induction from patterns detected in data. Ontology learning often utilizes machine learning approaches, but due to the large extent of the field this section will only include the basic principles of the field in order to understand the work presented in Section 4.3, such as the two main paradigms of *supervised* and *unsupervised* learning.

In *supervised learning,* the system provides labeled training examples **in**cluding the "correct answer" as input to a learning algorithm. The aim is to train the learning algorithm to give answers for new examples. The input is an n-dimensional feature vector, for example a system might get features such as the weight, and color. of an object to predict if the object is a **kiwi** or an orange. Every input feature corresponds to a dimension. In *classification*  tasks the output (the variable predicted) of a system is a discrete value, in *regression* analysis it is a continuous value. So in a classification task the algorithm predicts a target class label (from a set of classes) based on an input feature vector. *Binary classification* is a specialization of the classification task where there are only two target Gerhard Wohlgenannt - 978-3-631-75384-2 classes.

The prediction algorithm needs a mapping function from *Rn* (the feature vector space) to *L* ( the target class labels) [37]. The goal is to approximate the mapping function from training examples, the approximation must not be too close ( *overfitting)* in order to be able to generalize from training examples to new examples. More precisely, the aim of a classifier is to minimize the empirical risk of misclassification based on a loss function, which quantifies the cost of misclassifying one example from one class as another [37].

When training a classifier one has to be aware of the problems of *overfitting* and *skewed datasets.* To avoid overfitting, classifiers should never be evaluated on training data itself but on test data. The problem of skewed data emerges when some target classes are much more frequent than others. A credit card fraud detection classifier would gain 99.9% accuracy when the output is always "no fraud" - which is definitely not the expected behavior.

The algorithms of *unsupervised learning* need no explicit training examples as input, the input is just a dataset - for example a natural language text corpus - the learning algorithm tries to find interesting structures in the data. Typical examples of unsupervised learning are clustering methods, which detect and exploit frequent and common patterns in data. Other applications range from market segmentation to the detection of galaxies from astronomical data.

*Computational learning theory* is a branch of theoretical computer science concerned with the analysis of machine learning algorithms. On the one hand learning theory helps to estimate the performance of machine learning algorithms. Furthermore, it gives clues such as how many training examples are sufficient for a certain application of a supervised learning algorithm. Computational learning scientists also study the complexity and feasibility of learning. An algorithm is regarded as feasible if it runs in polynomial time.

#### **Supervised Learning Methods**

Supervised learning methods generate a function to map input (feature vectors) to an output. The type of the output variable ( discrete vs. continuous) determines if a classification or regression task emerges. Jurafsky and Martin [94] distinguish sequential and non-sequential classification problems. In a sequential classification problem a model is applied that assigns some label to each unit in a sequence. POS tagging is an example of such a problem. Probabilistic sequence classifiers compute a probability distribution over possible labels and choose the best label sequence. Hidden Markov models (see below) are an example of such a probabilistic sequence classifier. Nonsequential classification assigns a class to a single observation based on its features, this includes tasks such as Gerhard Wohlgenannt - 978-3-631-75384-2 text categorization ( e.g. is an email spam or not), and sentiment analysis (does the text fragment express positive or negative opinion) [94]. A probabilistic classifier also gives the probability that an observation is correctly assigned to a class, in fact it gives a probability distribution over all classes.

A common problem in machine learning are imbalanced datasets. An example was already mentioned with the credit card fraud sample, where almost all transactions are in the class "no fraud". In order to get the desired results from a classifier, techniques such as rebalancing are applied: *Oversampling*  replicates some training examples from the minority class, *undersampling* removes some examples from the majority class until the wanted distribution is obtained [37]. Rebalancing has to be used with care, oversampling may lead to overfitting, and undersampling removes potentially helpful input. Another way to cope with imbalanced datasets is the use of *cost-sensitive learning,*  which assigns relative costs of misclassification to the specific classes. The cost of misclassification for the minority class is typically high. The learning algorithm minimizes total cost.

**Bayesian Classification.** *Bayesian classifiers* are statistical classifiers to predict class membership probability [75]. They are based on Bayes' theorem, which states that one conditional probability, for example the probability of a hypothesis given observed evidence, depends on its inverse - in this example the probability of the evidence (E) given the hypothesis (H). In the simple case of only involving discrete distributions the Bayes theorem can be formulated as:

$$P(H|E) = \frac{P(E|H) \cdot P(H)}{P(E)}\tag{4.22}$$

Studies show that simple Bayesian classifiers, also called *naive Bayesian classifiers,* which assume that all features are independent and have the same relevance, are comparable in performance with other methods such as decision trees and neural networks [75]. In contrast to naive Bayesian models, *Bayesian belief networks* are graphical models that allow the representation of dependencies among subsets of features. The training of a Bayesian classifier is performed simply with a set of examples, which consist of a list of features and the respective classification for the example. Naive Bayesian classifiers are simple to learn and to adopt, and it's easy to interpret the learned probabilities of features. However, the independence of features leads to the inability to exploit combinations Gerhard Wohlgenannt - 978-3-631-75384-2 of features [167].

**Hidden Markov Models.** *Hidden Markov Models (HMM)* are models for sequence classification, and are among the most important learning models in speech and language processing [94], used e.g. for speech and handwriting recognition, or POS tagging. HMM base upon *Markov chains* (also called the *observed Markov model).* Markov chains are special cases of weighted finite-state automata, where every input sequence uniquely determines which states the automata will go through. A weighted finite-state automaton is defined by a set of states and possible transitions between those states - every such transition (arc) has an associated probability, which creates a transition probability matrix. Such a Markov chain helps to compute the probability of a sequence of events observed in the world. A HMM allows to talk about *observed events* and *hidden events* - hidden events cannot be directly observed in the real world. In POS tagging, for example, the observed events are the words in a sentence, the hidden events correspond to the POS tags. So next to a transition matrix a HMM includes a sequence of *observation likelihoods,* i.e. the probability of a observation being generated in a distinct state. HMMs are characterized by three fundamental problems [139, 94]: The first one ( *likelihood)* is to determine the likelihood of making a given observation sequence for a given HMM. The second one *(decoding)* is, given an observation sequence and a HMM, to discover the best hidden state sequence - in POS tagging that is the main problem to be tackled. The final tasks *(learning)* consist of learning the parameters of an HMM, given a sequence of observation and a set of states.

**Maximum Entropy Models.** *Maximum entropy (MaxEnt)* models are applicable for non-sequential and sequential classifiers. MaxEnt is a *probabilistic classifier,* it belongs to the family of *exponential* or *log-linear* classifiers. MaxEnt combines input features linearly, i.e. features are weighted and then added up - this sum is used as exponent in a function that determines the probability that observation *x* is in class c. For more detail on MaxEnt models and their background see for example Jurafsky and Martin [94].

**Decision Trees.** *Decision trees* ( also called *tree diagrams)* are models used for classification. Decision trees consist of internal nodes and leafs. *Internal nodes* correspond to tests for a distinct feature, the *arcs* reflect the value of a certain feature, which finally leads into *leafs.* Leaf nodes are the classes used for classification. So in order to classify some input observation, starting from the root of the tree, the features of the observation determine the path through the tree - until a leaf is reached. The learning or induction of a decision tree can be done in a greedy Gerhard Wohlgenannt - 978-3-631-75384-2 manner with a top-down recursive

divide-and-conquer algorithm [75, 138]. Starting from the root node, which includes all training samples, an entropy-based measure detects the feature with the most discriminative power, which then separates the samples into classes. The process is recursively applied until all samples are of the same class or there are no more features left.

A big advantage of decision trees is that learned models are easy to interpret by a human, for example to determine which features of the feature set are useful and discriminative. Decision trees are well suited for datasets with a lot of categorical data and numerical data that has breakpoints. For problems with many numerical input features or complicated relations between the features, decision trees are not the best choice [167].

**Kernel Methods.** *Kernel methods* are a class of algorithms for pattern analysis. They allow the study of general types of relations such as clusters, correlations and classifications in general types of data, for example text, sets of points, or images. As a recent development in the field of machine learning algorithms, kernel methods became widely used for relation extraction [7]. Traditionally, theory and algorithms of machine learning have been well developed for the linear case, but real world data and analysis problems often require nonlinear methods in order to detect the kind of dependencies that allow successful prediction of properties of interest [88].

The advantage of kernel methods is that they provide efficient training algorithms (as opposed to multi-layered neuronal networks for example, which are hard to train) and once trained, they are very fast in classifying new examples. Another strength is the ability to represent complex nonlinear functions [149]. Drawbacks are the need for large datasets in order to produce good accuracy, and the difficulty of interpreting a Support Vector Machine (SVM) [167].

SVMs are a specialization of kernel machines, which are typically applied for binary classification problems [37]. A nonlinear transformation maps the input vector space into some other vector space. The *kernel function,* which needs to be established, then defines the dot product between vectors in that transformed vector space [37]. Subsequently, the goal is to find the maximum-margin hyperplane which splits the training examples into two classes, this hyperplane represents the best discriminator between the two classes. Training examples next to the hyperplane are called support vectors, only the support vectors are finally needed to define the hyperplane [37]. So in fact a nonlinear problem, or more precisely nonlinear observations, are mapped into a higher-dimensional Gerhard Wohlgenannt - 978-3-631-75384-2 space, where a linear classifier is applied subsequently to solve the nonlinear problem - this is also called the *kernel trick.* 

**Neural Networks.** *Neural networks* are computational models which try to simulate the structure and behavior of aspects of biological neural networks, especially the human brain. Such neural nets connect groups of artificial neurons. Artificial neurons have inputs from other neurons, each with a weighting function attached, and an aggregation function for all inputs. If the input value is above a certain threshold the neuron fires a signal to connected output neurons.

There are unsupervised and supervised neural network models. In unsupervised learning the network is provided with input data only and it decides upon the features used for grouping input (e.g. in a clustering task) itself. In supervised learning the network is provided with a label training set. Training a neural network for a classification task adjusts the individual connection weights to predict the correct class label. Neural networks in general need long training times [37] and lack interpretability, i.e. it is hard to interpret the connection weights for a human - it is basically a large black box model. Having said that, neural networks can handle complex nonlinear functions [167], stand out in terms of tolerance against noise, and have the ability to classify patterns on which they have not been trained [75].

There are various types of neural networks, such as *feed forward networks*  or Kohonens *self-organizing networks* [97]. For more information about neural networks the reader is referred to seminal work in the field [146, 120] or newer literature such as [16, 80, 51].

#### **Unsupervised Learning Methods**

*Unsupervised learning methods* need no labeled examples, the methods aim at detecting structures in data. Typical applications in data mining and natural language processing are clustering and association rule mining. Clustering assigns a set of similar objects into subsets (clusters). Clustering approaches are divided into *hierarchical* and *non-hierarchical.* Non-hierarchical clustering ( also called *fiat clustering)* produces a set of groups. Hierarchical methods additionally create a tree structure between those groups. Furthermore, there is commonly a distinction between *hard* and *soft* clustering methods. Hard clustering assigns each object to exactly one cluster, while in soft clustering objects only have a certain degree of membership, i.e. a fractional membership, to a group - an objects assignment is a distribution over all clusters. Gerhard Wohlgenannt - 978-3-631-75384-2

**Flat Clustering.** Flat clustering creates a flat set of groups with no explicit structure that relates those groups. A well-known algorithm for flat clustering is *KMeans,* which starts by randomly selecting *k* centroids in the set of objects. The next step is the assignment of all objects to the centroid with they are closest - depending on a measure of distance ( e.g. Euclidian distance). A re-calculation of the current cluster then yields the new centroids. The assignment of objects and the recomputation of centroids is repeated until reaching some stopping criterion. KMeans leads to local optima - therefore it is usually iteratively applied with different random initializations. KMeans advantages are simplicity and efficiency.

**Hierarchical Clustering.** Hierarchical clustering methods produce a hierarchy of clusters. There are generally two types of approaches: *Bottom-up (agglomerative)* and *top-down (divisive).* Agglomerative hierarchical clustering starts by creating a cluster for every object. Then, at each stage the two most similar clusters are joined together. Similarity between objects can be measured by various metrics, such as Euclidian distance, Manhattan distance, etc. Additionally, the algorithm needs to compute the similarity between clusters with a technique such as *Single linkage, Complete linkage* and *Average linkage.* For more details see for instance Manning and Schutze [111].

In *top down* clustering the starting point is one big cluster, which contains all objects. Two questions are crucial: How to select the next cluster to split, and how to actually split a cluster into two [37]. A coherence function is a possible way to determine which cluster to split. Another option is to simply select the cluster containing the most objects. The subsequent task of splitting the cluster is basically a clustering task itself, where any clustering algorithm, such as KMeans, is applicable.

**Association Rule Mining.** We discuss this technique in a little more detail, as it is frequently used for the extraction of unlabeled (non-taxonomic) relations from text. *Association (Rule) Mining) (ARM)* is the task of finding correlations between items in a dataset [31]. The seminal work on ARM was motivated by the analysis of market basket data, which aimed at a better understanding of consumer purchasing behavior in order to exploit this understanding for better target marketing. Marketers use the results of ARM for optimizing (in terms of revenues for the seller) the placement of products in shops, and for price policy. The goal of ARM is to extract useful or interesting rules from data, rules that are novel, externally significant, unexpected, non-trivial, and actionable Gerhard Wohlgenannt - 978-3-631-75384-2 [93, 145]. The original idea was applied in many diverse areas, such as risk analysis in commercial environments, epidemiology, clinical medicine, fluid dynamics, or crime prevention, etc. [31].

The *market-basket problem* assumes a large number of items (such as all the products in a supermarket) and market baskets which include a subset of those items as starting point. This basically creates a sparse matrix, like it was discussed in the section about the vector space model (Section 4.2.3). When applying ARM methods in information retrieval, the products in a shop are substituted by the term set used in a document collection, and the market baskets by the documents.

**ARM** exploits the data to find rules with the following characteristics: *X* • *11,* where *X* is a subset of items from the whole itemset (I), and *1<sup>1</sup>* is a single item, which is not in *X.* The *confidence* of such a rule equals the probability that *X* is present in a transaction (i.e. basket, document, etc.). Even more interesting than the confidence of a rule is its *lift,* i.e. the observed confidence related to the confidence expected by chance. For example *milk, water* \* *bread* might have a high confidence because bread is in many market baskets, but the question is if there is some *causality X* • *11,*  which means that *X* "causes" *11* to be bought, expressed by a confidence level higher than expected. In most applications only rules about items that frequently occur in transactions are of interest *(frequent itemsets).* The metric *support* for an itemset yields the ratio of transactions where an itemset is present. In many situations *support thresholds* are applied, for example a threshold of 0.01 means that the itemset *X* has to be present in a least 1 % of all transactions.

The research field of ARM is mature. After the seminal *apriori algorithm*  of Agraval et al. [2], many algorithms were proposed. For the interested reader, Ceglar and Roddick [31] give a well-written survey about ARM fundamentals and the evolution of ARM algorithms.

#### **Toolkits**

There are a number of open source toolkits that support the application of machine learning methods, a very prominent one is Weka [193], which provides a collection of machine learning algorithms for data mining tasks such as data pre-processing, classification, regression, clustering, association rules, and visualization. Another Java based package is Mallet [117], which includes components for statistical natural language processing, document classification, clustering, information extraction, etc. RapidMiner2 is a ma-

<sup>2</sup>http://rapid-i.com Gerhard Wohlgenannt - 978-3-631-75384-2 Downloaded from PubFactory at 01/11/2019 05:40:36AM via free access

chine learning framework, which supplies open source packages as well as an enterprise product for commercial customers.

As already mentioned, this section only scratched the surface of machine learning, the interested reader is referred to extensive information found for example in [121, 181, 79, 17, 51]. Applying many of the techniques discussed in the current section, the next section (Section 4.3) gives an overview of state-of-the-art methods in the learning of ontological relations.

## **4.3 Literature Review**

This section presents the related literature for the research field of this thesis, i.e. learning non-taxonomic relations in ontologies. It is hard to group the state of the art into categories as the work can be classified along - at least two dimensions: The type of underlying data used, and the methods applied. Among the main types of data used are domain-specific natural language text, the Web and online semantic information (ontologies, etc.). In a crude roundup the methods can be divided into methods exploiting semantic association with statistical and machine learning techniques, linguistic methods relying on textual patterns, and methods based upon reasoning on Semantic Web information. Many approaches are not restricted to a single type of data or method, for example Schutz and Buitelaar [164] apply statistical and linguistic methods to domain text. In the following, the approaches are grouped based on their major characteristics. Table 4.2 provides an overview of the classification schema.


*Table 4.2:* Classification schema for related literature based on the main type of input data used and the methods applied

Additionally to the given categories some of the related work can hardly be fit into the given classes due to its specific type of environment and goal (Section 4.3.6) or because of its focus on a specialized type of relation (Section 4.3. 7).

The classification of related literature results in the following categories:

1. Section 4.3.1: Work that has domain text as major source of input and exploits semantic associations Gerhard Wohlgenannt - 978-3-631-75384-2 of various features with techniques from machine learning and corpus statistics, such as association rules, c~occurrence statistics, kernel machines, and clustering methods.


#### **4.3.1 Domain Text and Semantic Associations**

The first step in learning non-taxonomic relations is usually to detect unlabeled relations between concepts. For this task many authors exploit Harris' distributional hypothesis [77], applied for example by Liu et al. [105] in c~ occurrence analysis combined with spreading activation to detect unnamed relations (see Section 4.4). Applying unsupervised machine learning, Madche and Staab [107] discover non-taxonomic relations by the adoption of ass~ ciation rules (see Section 4.2.4). They define transactions in terms of the words occurring together in certain syntactic dependencies, which are then used as input to a generalized association rules algorithm. Their method also covers the handling of relations between instances of the same concept (for example two instances of the concept *person* that cooperate with each other). In addition to finding unnamed relations, the approach also detects the appropriate level of abstraction of the involved concepts with respect to a given concept taxonomy. Madche and Staab evaluate their method against a gold standard ontology (an *a priori* Gerhard Wohlgenannt - 978-3-631-75384-2 evaluation, see Section 5.5.5).

Yamaguchi [198] applies Schi.itze's word space model [165] to extract similar terms and suggest potential relations. He uses a 4-gram (four word) window to find related words, and then applies the cosine measures (see Equation 4.19) to compute the similarities. If the similarity is above a certain threshold the system suggests a relation between the involved terms. Heyer et al. [86] rely on collocations in large text corpora to extract unnamed semantic relations between concepts. They suggest that certain properties of a relation, such as symmetry, anti-symmetry or transitivity, can be detected from the organization of collocations. They also propose *second-order collocations,* i.e. collocations of collocations, in an iterative process, arguing that higher-order collocations lead to more homogeneous classes. Ciaramita et al. [34] present an unsupervised method for learning arbitrary relations between concepts of a molecular biology ontology. They learn relations between named entities from the Genia3 corpus with standard natural language processing techniques ( a statistical dependency parser). They also generalize the relations found with respect to the Genia ontology, where they evaluate if using a hypernym instead of the hyponym leads to significantly different probabilities (relying on an approach of Clark and Weir [43]). Byrd and Ravin [25] derive unnamed relations between salient concepts from a document collection by calculating the (normalized) mutual information between concept pairs. On a corpus from the biomedical domain, Reinberger and Spyns [141] employ statistical methods based on frequency information over linguistic dependencies to discover unnamed relations.

Kavalec and Svatek [95] present a method to label otherwise anonymous (non-taxonomic) relations between concepts as extension of the *Text-to-Onto4* ontology learning framework [109]. This unsupervised method extracts relevant lexical entities (verbs or verb phrases) frequently occurring with concept associations. They introduce the *above expectation* heuristic to measure the association between verbs and concepts - as the ratio of observed joint frequencies compared to expected frequencies under the assumption of independence. The authors evaluate the quality of labels in the tourism domain (Lonely Planet) 5 and on semantically tagged corpora (SemCor)6 in an *a-priori* evaluation against a gold standard. They also involve domain experts to evaluate the correctness of divergent relation types. An important issue raised in Kavalec and Svatek [95] is the problem of directly mapping co-occurrences ( e.g. co-occurring verbs) to "deep" ontological relations, as the verbs often also occur **in** a larger semantic context.

**<sup>4</sup>http://sourceforge.net/projects/texttoonto** 

**<sup>3</sup>http://vwv-tsujii.is.s.u-tokyo.ac.jp/GENIA/home/viki.cgi** 

**<sup>5</sup>http://vwv.lonelyplanet.com** 

**<sup>6</sup>http://vwv.cs.unt.edu/-rada/dovnloads.html#semcor** Gerhard Wohlgenannt - 978-3-631-75384-2 Downloaded from PubFactory at 01/11/2019 05:40:36AM

*RelExt, as* described by Schutz and Buitelaar [164], provides a tool for relation extraction in the context of ontology extension. They build on the common idea that verbs express the relation between two concepts, and specify domain and range. This idea is associated with the work on selectional preferences for verb arguments [188]. RelExt extracts relevant verbs and their grammatical arguments from domain-specific text and computes corresponding relations with a combination of statistical and linguistic processing. More precisely, in a first step highly domain-relevant headnouns and verbs are extracted, then the algorithm computes selectional preferences for the verbs. Finally, that information is used to construct triples. So basically the most relevant verbs are chosen *as* relation labels.

Gamallo et al. [63] present a corpus-based approach to automatically extract semantic relations between words. In a first step, syntactic dependencies are automatically classified according to their selectional restrictions, thereby creating semantic groups. Furthermore, they detect groups of nouns according to their distribution in detected selectional restrictions. Interpretation rules help to learn the specific semantic relations underlying syntactically related words, i.e. the interpretation rules provide a mapping from the syntactic level to relations in a semantic space.

Cimiano [37] proposes a method for learning relations from corpora based on verbal expressions that follows the tradition of the already mentioned work of Gamallo et al. [63], Schutz and Buitelaar [164] and Ciaramita et al. [34]. The main focus in this approach is the generalization of the arguments of a relation with respect to a taxonomy. They demonstrate this with an example: instead of *work\_for(woman,store)* or *work-.for(employee,institute)* the most general signature, in this *case* possibly *work\_for(person,organization),* is of most interest. Various statistical measures, namely conditional probability, x2 and Pointwise Mutual Information (PMI) are evaluated for their ability to find the correct level of generalization upon the Genia corpus and the Genia ontology. Conditional probability outperforms the other measures in these experiments.

Ciaramita et al. [35] present an unsupervised approach for learning arbitrary relations between annotated named entities in the molecular biology domain using the Genia ontology and the Genia corpus, the approach is also applicable to other domains. The method relies on dependency structures generated by a constituent syntactic parser [22] for the extraction of relation candidates. A x2 test which compares the observed and expected frequencies helps to select relations for ordered pairs of named entities from the list of candidates. A manual evaluation of the method is included. Rinaldi et al. [144] describe an environment to extract domain-specific relational information, with experiments based Gerhard Wohlgenannt - 978-3-631-75384-2 on an extended version of the richly annotated Genia corpus. For this task they apply deep-linguistic parsing and manually created patterns as well as ontological constraints. In an unsupervised method to learn ontologies from scratch, Reinberger et al. [142] apply shallow parsing to select functional relations from the syntactic structure subject-verb-direct-object. Clustering then allows to build semantic classes of terms sharing a certain relation. They applied and evaluated the approach upon two domain corpora, one from the medicine domain (SwissProt) and a small legal corpus.

Poesio et al. [132] present a supervised approach that learns feature norms for given concepts, that is relation types that intend to "provide insights into mental representation of concepts". These feature norms are related to *qualia structures,* and are manually compiled for subjects aiming to find the most important properties for a set of concepts. The authors evaluate relations such as *external surface property, origin* or *function.* They applied kernel methods for the learning process, combining a global and a local kernel function with mostly linguistic features, and used SVM as learning algorithm.

Zelenko et al. [202] leverage kernel methods to extract relations from unstructured natural language text. The kernels are defined over shallow parse representations of text. With the help of SVMs and Voted Perceptron as learning algorithms, they extract the specific relations *person-affiliation*  and *organization-location.* An evaluation of the method comparing it with feature-based learning algorithms shows promising results.

#### **4.3.2 The Web and Semantic Associations**

Wong et al. [196] propose a method for acquiring semantic relations for the construction of lightweight ontologies which uses only Web resources (Wikipedia and search engines) as input. Their approach includes two phases, namely *term mapping* and *term resolution.* In the mapping phase Wikipedia mappings yield connections between input terms. The main contribution is the resolution phase, which comprises *lexical simplification, word disambiguation* and *association inference.* Lexical simplification reduces the lexical complexity of composite terms in order to be able to find mappings in Wikipedia. Mutual information between constituents of an input term calculated with Google page count statistics guides the lexical simplification process, resulting in appropriate subphrases. Word disambiguation aims at finding correct senses for ambiguous terms by the virtue of the senses' relatedness to the already mapped terms. In the association inference step cluster analysis is applied to terms labeled as *non-existent* during the mapping phase, which means that those terms have Gerhard Wohlgenannt - 978-3-631-75384-2 no lexical matches in Wikipedia. The authors propose a term clustering algorithm with featureless similarity measures known as *Tree-11raversing Ant* [195] to generate potential associations.

Jiang et al. [90] present a knowledge-rich method for the mining of generalized associations of semantic relations, based on textual content from the Web. As opposed to classical text mining methods, which transform the input textual content into simplistic intermediate representations ,such as bags of words or word vectors, the authors aim at an intermediate representation that can express semantic relations between the concepts found in text. For this purpose they use **RDF** (see Section 3.2.1), which enables the representation of text as simplified conceptual graphs. After applying NLP tools such as part-of-speech, tagging a set of predefined syntactic patterns is used to extract semantic relations, which are encoded as RDF statements. Additionally a term taxonomy is generated on-the-fly with WordNet and domain-specific lexicons. As traditional association rule mining on the extracted **RDF** statements suffers from data sparseness (relations are seldom repeated in many documents), and some appropriate kind of generalization is needed, the authors propose a novel generalized association pattern mining algorithm ( *GP-Close)* to find the proper level of abstraction and labeling.

#### **4.3.3 Domain Text and Linguistic Patterns**

Many authors have applied handcrafted patterns in the tradition of Hearst to natural language text for various tasks, for example anaphora resolution [133], or in specialized environments, for example the extraction of relations in texts surrounding images [3]. Berland and Charniak [14] adopted Hearst patterns for the identification of meronyms *(part-of* relations).

Various other approaches to learn specific relation types based on linguistic patterns are listed in Sanchez and Moreno [156]. Yamada and Baldwin [197] discover *telic* and *agentive* roles for nouns from text data - as parts of qualia structures, where the telic role represents a typical purpose of the entity and the agentive role represents the origin of the entity, they rely on certain lexico-syntactic patterns as well as maximum entropy model classifiers; Girju and Moldovan [67] present a semi-automatic method to discover generally applicable lexico-syntactic patterns that refer to the *causal* relation. Poesio and Almuhareb [131] present a method for determining combinations of some of these relation types. This type of handcrafted patterns **works**  well for specific relation types in a given domain, but is restricted to certain relations and domains as the cost of adopting patterns can be too high [122].

Byrd and Ravin [25] extract salient concepts from document collections, and unnamed (see above) and named relations between them. They extract named relations with certain grammatical Gerhard Wohlgenannt - 978-3-631-75384-2 patterns using specially-built finite state automata, for example "Gerstner, the CEO of IBM, ... " results in the relation triple <Gerstner:CEO:IBM>. Filtering the output, such as including selectional restrictions facilitated by named entity recognition, helps to improve the results.

Alfonseca et al. [7] present methods and algorithms to improve the precision of rote extractors, which are a common method to extract non-taxonomic relation instances (see Section 4.2.2). Their evaluation shows that precision values are lower than expected for many patterns learned by traditional rote extractors, especially for those that are ambiguous - those patterns are filtered subsequently. Ruiz-Casado et al. [148] apply these improved rote extractors aiming at semi-automated semantic annotation of Wikipedia. Based on a given set of relations they associate Wikipedia entries and argue that - although automatic methods in the field of natural language processing (NLP) typically produce some amount of mistakes - it needs less effort to correct the mistakes than annotating the relations from scratch. Their method starts with a seed list of training examples per relation type, and extracts sentences from a Wikipedia corpus with NLP tools. The corpus itself is created by recursively crawling a part of Wikipedia from some starting points. The patterns found in text are then generalised in order to raise recall and pruned to improve precision. They evaluate the approach with eight predefined relations such as *person's birth year, actor-film* or *player-team.* The measured precision ranges from values >74% for *person's birth* to below 10% for *player-team.* Ruiz-Casado et al. attribute this to the fact that some relations often appear with fixed and unambiguous patterns, other relation types use more general and ambiguous patterns.

Chagnoux et al. [33] extend the idea of automatically learning new patterns for given relation types. Their system integrates new relations found in external ontologies, and automatically learns patterns representing the new relations - thereby iteratively extending the pattern base. But the architecture is not completely automatic, for all new patterns and relations they enforce a step of manual validation to ensure correctness and relevance.

#### **4.3.4 The Web and Linguistic Patterns**

Etzioni et al. [53] use Hearst style patterns applied *to the Web* as part of *KnowltAll,* a system that aims to automate extracting large collections of facts from the Web autonomously, domain-independent, and in a scalable manner. Markert et al. [114] apply shallow patterns to the Web for nominal anaphora resolution. Cederberg and Widdows [30] show that the precision of Hearst patterns can be improved by filtering the results of patterns with Latent Semantic Indexing (see Section Gerhard Wohlgenannt - 978-3-631-75384-2 4.2.3). They assume that hyponyms

and hypernyms are distributionally similar, and filter pairs below a certain threshold - resulting in a reduction in the rate of error by 30%. To increase the recall of Hearst patterns they apply a graph-based model of noun-noun similarity which was learned automatically from coordination patterns, and present a five-fold increase in the number of correct hyponymy relations extracted.

Section 4.2.2 already described the idea of Open Information Extraction - it bases on large-scale (to the point of Web scale) extraction of relational data, independent of the type of relation. TextRunner [9] is an implementation of the Open IE paradigm on the basis of natural-language text. To figure out if there is a general model of how relations are expressing in English text the authors manually examined 500 random sentences from an IE corpus, and come to the result that most relations are indeed expressed with a compact set of relation-independent patterns. These patterns are listed in [52], and include very simple ones such as "E1 Verb E2". Detailed additional contextual clues are necessary to decide if there really is a relation between the two entities occurring with the verb. The original version of TEXTRUNNER, presented by Banko et al. [9], used a "Naive Bayes Classifier to predict whether heuristically-chosen tokens between two entities indicated a relation or not" [10, p 32]. This classifier was then replaced with a graphical model called a conditional random field (CRF), which, given a set of input observations, maximizes the conditional probability of a finite set of labels. With CRF the extractor learns to label each word in a sentence by annotating the beginning and end both of entity names and relation strings. Among the features used in the model are regular expressions, part-of-speech tags, context words, etc. For more details on the model and its characteristics see Banko and Etzioni [10]. After training the model, TEXTRUNNER can be run on a corpus in linear time and extracts triples trying to capture the relations existing in the sentence. Many additional modules such as synonym detection help to improve the quality of extracted relations, or to make them accessible, e.g. by indexing them with Lucene. 7 The applications of the system are various, for example for *question answering, opinion mining* and *fact checking* [52]. Compared to traditional information extraction, Open IE offers higher levels of precision, at the expense of recall. Open IE should be preferred when the relation labels are not known in advance, new relations should be discovered, or their number is massive. Banko and Etzioni [10] also present and evaluate a hybrid extraction approach combining traditional and Open IE.

<sup>7</sup>http://lucene.apache.org Gerhard Wohlgenannt - 978-3-631-75384-2 Downloaded from PubFactory at 01/11/2019 05:40:36AM via free access

WEBTABLES [26] applies the Open IE paradigm to the extraction of relations from structured data, more precisely from HTML tables on the Web. So the approach aims at generating relational data by exploiting the implicit structure of the HTML table tag. Cafarella et al. [27] estimate that only a minor percentage ( 1.1 % ) of HTML tables on Web really contain relational data, the rest is used for page layout etc. The main challenge is to distinguish *relational* from *non-relational* tables automatically, WEBTABLES applies a two-step procedure: Step 1 throws away tables that are obviously not relational. In step 2 a statistical classifier distinguishes *relational* from *non-relational* tables based on a set of hand-written features, such as the number of rows, the number of columns, the number of columns with numeric data etc. The advantage of exploiting tables is that they contain a big number of facts structured in a way that makes it easy to detect the involved terms and their relations.

The so-called "Deep Web" refers to content on the Web only accessible through forms, and therefore usually hidden from search engine crawlers. Cafarella et al. **[27]** propose a method to surface that hidden information into Web pages that can be indexed by search engines. Referring to the spirit of Open IE that method should be efficient and scalable. The major obstacle is to pre-compute the form submissions for any given form in order to surface plenty of the underlying database. Cafarella et al. [27] propose heuristics such as using keywords extracted from the page and iteratively from the result set, as well as using libraries of types for typed text boxes (e.g. US zip codes).

Sanchez and Moreno [156] present an unsupervised approach using verbs from sentences containing domain concepts and search engine queries in the process of learning non-taxonomic relations. This method combines a pattern/rule-based approach with the intensive use of Web statistics. We describe it in some more details, because the approach also includes and exemplifies the use of Web statistics, which some researchers in the field of ontology learning applied quite successfully in the last few years.

The Web, due to its huge size and heterogeneity can be assumed to approximate the real distribution of information of mankind [36]. Relying on the Web is a way to tackle the *sparse data problem* [96]. Although individual Web resources are considered untrustworthy, redundancy of information on different sites can represent a measure of relevance and trustiness [42]. Keyword based search engines such as Google or Yahoo provide statistics about the information distribution on the whole Web. These statistics about the presence of a certain query term can be computed efficiently from the estimated amount of returned results. Turney [183] presents several heuristics to leverage statistics provided by search Gerhard Wohlgenannt - 978-3-631-75384-2 engines, for example forms of *pointwise* 

*mutual information* (see Section 4.2.3). These measures use hit counts to calculate the degree of relation between two query terms - for example with a query between an initial word *(term1)* and a related concept (term2):

$$Score(term\_1, term\_2) = \frac{hits(term\_1 \text{ AND } term\_2)}{hits(term\_2)}$$

Such statistics can be retrieved in a very efficient and almost immediate manner, avoiding the analysis of large corpora, and they provide very robust measures *as* they are obtained from the whole Web.

Returning to the work of Sanchez and Moreno, the authors start the process of learning non-taxonomic relations with the extraction of verbs including prepositions from sentences that contain concept pairs, more precisely a concept and the hyponym of the concept, from an ( evolving) taxonomy built with their ontology learning system. They apply some linguistic filtering rules to raise the quality of the resulting verbs [156]. The verb candidates are then tested for domain relatedness with a query that adopts the Web statistics formula presented above:

$$Score(verb, domainKeyword) = \frac{hits(verb \text{ AND } domainKeyword)}{hits(domainKeyword)}$$

A selection threshold controls which verbs are considered domain-specific, empirically a value of *IE* - 3 to *IE* - 5 appears suitable. Those verbs are the labels for the new, domain-specific relations. Search engine queries with concepts and their respective verbs return a corpus of sentences. A very strict linguistic pattern extracts candidate terms for a non-taxonomic relation with the original concept and verb. A search engine query similar to the one just presented tests those concept candidates for domain relevance. The ontology learning process inherits the learned non-taxonomic relations to all subclasses of concepts, which also saves computational resources, *as* selected or rejected verbs need not be examined again for subclasses. The method of Sanchez and Moreno has some interesting features: it is completely unsupervised, so it avoids the need for a human expert, it is a domain independent solution, and the learned ontologies can be dynamically adopted and extended to reflect evolving domain knowledge. One of the downsides is that the learned relation labels have no further semantic properties, i.e. they cannot be used for inference, and the current implementation lacks capabilities to detect synonyms, inverses, etc. Gerhard Wohlgenannt - 978-3-631-75384-2

#### **4.3.5 Semantic Web Data and Reasoning**

This section introduces related work that exploits structured data present in the current Semantic Web to support ontology learning tasks, especially to detect relations between concepts. Harvesting the Semantic Web, i.e. automatically finding and exploring online knowledge sources, has been a novel trend in the last few years - favored by the recent growth of online semantic data and the building of gateways to access these data [151].

Alani [4] proposes a method for ontology construction by cutting and pasting ontology modules from online ontologies. He proposes a five-step system architecture for ontology construction: (i) identify ontologies relevant to a keyword search via Semantic Web search gateways, rank these ontologies for relevance in (ii); (iii) segment ontologies to extract relevant parts; (iv) merge those parts with ontology merging/mapping algorithms; (v) evaluate of the constructed ontology to ensure a certain level of quality.

Another interesting method, which is not directly related with the learning of non-taxonomic relations, but could potentially be adopted to disambiguate and enrich relation labels, is presented by Garcia et al. [71]. Their unsupervised approach dynamically uses online ontologies for word-sense disambiguation of input keywords. The knowledge represented by a pool of ontologies available on the Web yields possible senses for the input keywords. The algorithm then combines the information from the Semantic Web with Google based frequencies to select the right senses.

Scarlet [152]8 provides a technique for discovering relations between two concepts by harvesting the Semantic Web. We present this approach in more detail, because it is a part of the method described in the present thesis. Scarlet automatically selects and exploits online ontologies to discover relations between two input concepts. In a simple example, given the concept label *Researcher* and *AcademicStuff,* Scarlet identifies online ontologies to determine how the two concepts are related at run-time - and combines the information to infer the relation, e.g. Researcher ~ AcademicStuff. Originally Scarlet was restricted to *subClassOJ* (~) and *distjointWith* (-.L) relations, but has been extended to include *named relations* as well. Scarlet was initially built for the task of ontology matching, where it delivered background knowledge from the Semantic Web to the matcher [150] - but the component and its functionality can be integrated into third-party systems as well. Various parameters help to regulate the run-time performance and accuracy of Scarlet.

Relation discovery with Scarlet anchors the given input concept labels in online ontologies *(A* and *B* are anchored as *A'* and *B').* There are basically two strategies: Strategy Sl consists of finding ontologies that contain both

Figure *4.4:* Relation discovery within one ontology (S1) and across ontologies (S2), from Sabou et al. [152]

concepts *A* and *B.* The system extracts the relations from those ontologies, combines them in a way set by the given parameters (see below), and returns the result to the caller. If strategy S1 fails, then strategy S2 can be applied. S2 uses multiple ontologies to extract relations in a recursive fashion - concepts related to *A* are extracted from one ontology, and then concepts related to *B* from another ontology. The concepts related to *A* include the parent concepts and subclasses of A For example, to detect a relation between *cabbage* and *meat,* one ontology might state that Cabbage !:;: Vegetable, and another that Vegetable *1-* Meat, resulting in Cabbage *1-* Meat. Figure 4.4 gives a graphical impression of S1 and S2 [152].

Scarlet supplies a set of parameters to customize its behavior. As mentioned above, the caller can decide whether to use strategy S1 or S2, which has a significant impact on the run-time of a query. Furthermore, the *number of derived relations* is configurable. The options range from just returning the first found match, with higher risk that the information is inaccurate, to returning all found matches, and potentially combining them, which is computationally more expensive. The methods to combine relations if various matches are found range from returning all matches unaggregated to returning the most frequent relation type or only return a type if all relations are the same. Finally, for strategy S2, the depth of search in ontology hierarchies determines if only the direct parent and subclass of a concept are considered (depth= 1), or deeper levels as well Gerhard Wohlgenannt - 978-3-631-75384-2 (depth= *n).* 

Scarlet is closely related to the Watson Semantic Web search engine [48] 9, which serves *as* ontology retrieval backend, although Swoogle can be used *as*  backend, too. Watson collects and indexes semantic information found on the Web, and provides a variety of access mechanisms; the goal is to support the building of new kinds of applications that benefit from of the Semantic Web. Via its plugin10, Watson helps ontology engineers to edit an ontology, suggesting additional statements for classes found in online ontologies. The plugin is available for the NeonToolkit11 and for Protege. 12 Evolva [201] integrates Watson and Scarlet in an ontology evolution system, where Scarlet is applied to retrieve relations between existing and newly added concepts in the evolving ontology.

Aleksovski et al. [6] use an idea similar to Scarlet. They extract relations between terms by looking for relations between their anchored concepts in background knowledge. That background knowledge is a rich domain ontology, and finding relations means using a reasoning service to exploit the structure of the background knowledge ontology. This approach depends on the availability of a suitable, i.e. large and rich, domain ontology appropriate for the task at hand.

The goal of the DBpedia Relationship Finder (RF) of Lehmann et al. [102] is to provide a user interface to explore the DBpedia dataset by giving a way to find connections between different objects. RF uses structured data, that is RDF triples, from the DBpedia infoboxes managed by a triple store and lets users query the data by entering two objects which are described by Wikipedia articles. It yields a number of labeled connections between the two input objects including all the intermediary objects connecting the two. A path from *Object1* to *Object2* therefore typically includes a number of different relation (property) labels. Parameters such *as* maximum number of results, maximum distance and a blacklist of objects or properties which are not allowed in the connection help to fine-tune the query. The method that has been successfully applied to DBpedia is also applicable for other RDF graphs. In a preprocessing step the RF detected subgraphs in the DBpedia dataset, and determined that the DBpedia graph is very dense (almost all objects have a distance from five to nine from a random start object). If adopted in the learning of non-taxonomic relations, the approach has some shortcomings: Firstly, the objects and relations are not domain-specific, all DBpedia data is included. Secondly, *as* the system returns a number of paths between two objects, and those paths each include a number of intermediary

**<sup>9</sup>http://watson.kmi.open.ac.uk/WatsonWUI** 

**<sup>10</sup>http://watson.kmi.open.ac.uk/editor\_plugins.html** 

**<sup>11</sup>http://www.neon-toolkit.org** 

**<sup>12</sup>http://protege.stanford.edu** Gerhard Wohlgenannt - 978-3-631-75384-2 Downloaded from PubFactory at 01/11/2019 05:40:36AM via free access

objects, it is non-trivial to agree on a relation label between to two input objects.

#### **4.3.6 Selected Work from SemEval2007**

The Fourth International Workshop on Semantic Evaluations (SemEval 2007, previously known as SensEval) hosted a competition including the task "Classification of Semantic Relationships between Nominals" [68], which is also relevant for non-taxonomic relation learning. 14 teams, with 15 systems, participated in the challenge, giving a good overview and evaluation of methods currently used in that area. The tasks consisted in the classification of semantic relations between simple nominals other than named entities (nouns or noun phrase, e.g. *honey* and *bee).* The competition provided a benchmark dataset, which included training and test data for seven relation types, such as *instrument-agency, product-producer,* etc., each of which was to be tackled as a separate binary classification task.

The sentences for the dataset were collected with handcrafted patternbased Google queries, e.g. "\* contains \*" for the *Part- Whole* relation. The SemEval dataset contains 140 training and about 70 testing sentences for each of the seven given relation types, with about 50% positive, and 50% (near miss) negative sentence classes. The sentences are tagged with the respective nominals and the relation between those nominals. Additionally the dataset provides the WordNet sense keys for the nominals, as well as the Google query that was used to collect training and target data. The SemEval competition for this task was subdivided into four categories, depending on whether or not the participants used the WordNet sense keys and the Google query which lead to the dataset as additional features in their systems. The four resulting categories were *A* (without WordNet, without Google query), *B* (with WordNet, without Google query), *C* (without Word-Net, with Google query), *D* (with WordNet, with Google query). For a more detailed description of the task and the datasets see Girju et al. [68].

The category overview in Girju et al. [68] shows that most teams participated in categories A and B, so they did not use the Google query. Many participants applied linguistic features such as syntactic dependencies, lexicosyntactic patterns, grammatical relations and included features extracted from WordNet (linguistic features, similarity measures, etc.). Many participants applied kernel methods (with SVM learning algorithms) for the classification tasks, but some also used other classifiers like decision trees or naive Bayes.

The following listing briefly introduces some of the participating (and well-performing) systems, without the claim that those systems are superior in terms of results to others not considered here. The aim is to exemplify some of the methods used in that challenge.


#### **4.3. 7 Learning of Qualia Structures**

Some researchers have been working on the automatic acquisition of *qualia structures* over the last two decades. Qualia structures are relevant for ontology learning as they describe a fixed set of relations that every object processes [37]. Qualia structures originate from the work of Pustejovsky [136] and his *Generative Lexicon* framework - where Pustejovsky reused Aristotle's basic factors which describe the nature of an object: the *material cause* (the material an object is made of), the *agentive cause* (the source of movement, creation or change), the *formal cause* (the form or type of an object) and the *final cause* (the purpose or aim) [37]. Pustejovsky transforms Aristotle's basic factors into the four qualia roles. The *constitutive role* describes the physical properties of an object. The *agentive role* describes factors that bring an object into existence. The *formal role* includes properties that distinguish an object from others, and finally the *telic role* refers to the purpose or function of an object. Cimiano [37] gives an example relying on Johnston and Busa [91]: The qualia structure of the object *knife* could be specified as follows: Constitutive (blade, handle, etc.), formal (artifacUool), telic ( cut\_act), agentive ( make\_act).

The identification of the qualia structure of an object uncovers important ontological properties about this object. Some of the qualia relations have been studied by the artificial intelligence community, especially *partwhole* and *subClassOf* Cimiano and Wenderoth [40] present a method to automatically learn qualia structures for arbitrary nominals with evidence collected from the Web - thereby facilitating large scare qualia assessment. The approach relies on lexico-syntactic patterns which convey certain semantic relations, those patterns are matched on the Web via search engines. Evaluations show that the results of the method are reasonable. Such a system can help lexicographers aiming at constructing lexicons, or NLP applications that incorporate deep lexical knowledge. Cimiano and Wenderoth extend their original approach by ranking extracted qualia elements for each qualia role with various methods [41]. The ranking, combined with a cutoff point, yields a reliability indicator for humans or systems inspecting the qualia structures. Among the evaluated measures, plain conditional probability and Web-based conditional probability gave the most promising results.

Other work on qualia structures includes the learning of telic and agentive relations by Yamada and Baldwin [197] (see above), related work by Poesio and Almuhareb [131] and Poesio et al. [132], or by Pustejovsky [137], who presents a framework for the acquisition of semantic relations from corpora based on the Generative Lexicon theory. This framework uses statistical techniques such as collocation analysis Gerhard Wohlgenannt - 978-3-631-75384-2 with linguistic phenomena like metonymy

or polysemy in the process of knowledge acquisition. Claveau et al. [44] present a supervised method that decides whether a given verb is a qualia element or not. The method relies on Inductive Logic Programming and uses features such as part-of-speech, semantic tags for words, or the relative position of words in order to derive rules to predict if there is a qualia relation between a noun and a verb.

The discussion of methods to extract qualia structures concludes the overview over related work in the field of ontology learning. The upcoming section on the webLyzard ontology learning framework (Section 4.4) describes an architecture capable of ontology learning tasks such as terminology extraction and concept formation, and the learning of taxonomic and unlabeled relations between concepts.

## **4.4 webLyzard Ontology Learning System**

Two factors motivate this section about the webLyzard Ontology Extension (wL-OE) architecture: It serves as a showcase for some of the methods and ideas presented in the previous sections, introduces new methods, and applies these methods to semi-automatically learning ontologies. The novel techniques described and evaluated in the upcoming sections are strongly related to the wL-OE framework, as they introduce a component for the previously missing detection of non-taxonomic relations.

#### **4.4.1 System Overview**

Figure 4.5 shows the wL-OE architecture and a graphical overview of the interaction between its major components. The starting point and input to the ontology extension process is a *seed ontology.* The seed ontology is typically a small ontology including a number of important domain concepts either manually compiled by a domain expert, or an already extended ontology from a previous ontology extension iteration. Figure 4.6 gives an example of such a seed ontology.

The seed ontology is fed into the lexical analyzer, which distributes it to various *evidence sources* in order to find promising new concepts related to the concepts from the seed. Three methods are combined, *co-occurrence analysis* [147] to extract terms related to the seed concepts, *trigger phrases* [92] that indicate certain relations between terms ( e.g. hyponymy), and *Word-N et* [56] to provide hypernyms, hyponyms and synonyms. The new terms are then connected to the seed ontology to form a *semantic network* via labeled inks. Gerhard Wohlgenannt - 978-3-631-75384-2

*Figure 4.5:* Overview of the webLyzard ontology extension architecture [105]

After transforming the semantic network to a spreading activation network, the next step detects the *most important concepts* with a spreading activation algorithm, resulting in candidate concepts for an extended ontology. In the following *concept positioning* step various methods such as headnoun analysis, WordNet and additional rounds of spreading activation serve to determine the appropriate position for new concepts in the ontology, and also help to manifest eventual taxonomic relations between concepts. The original version of wL-OE has no component for the detection of nontaxonomic relations.

All corpus-based methods in this architecture typically build on domain corpora collected and annotated with the webLyzard suite of Web mining tools. 13 This platform includes crawling agents, which incrementally mirror numerous Web sites in regular intervals (e.g. weekly or monthly), organized by a set of samples. Those samples comprise the Fortune 1000 companies, over 150 international news media sites, and many others. Since 1999 the platform has amassed several terabytes of Web data. The collected HTML content, as well as other document types such as pdf-documents and

I:lhttp: / **/'ITWW.** weblyzard. com Gerhard Wohlgenannt - 978-3-631-75384-2 Downloaded from PubFactory at 01/11/2019 05:40:36AM via free access

*Figure 4.6:* Simple seed ontology for the domain of *energy sources* 

word-documents, are converted to plain text and processed with natural language processing techniques, which include segmentation (sentence splitting), part-of-speech tagging, language detection, lemmatization, etc. Additional components yield sentiment detection on document and sentence level, and methods for domain detection, e.g. the assessment that a particular document from a news media site is assigned to the climate change domain. The domain detection algorithm matches a list of domain keywords (represented by regular expressions) against the document, and computes a domain specificity value which takes into account various features of the matching process. The platform also provides the functionality to build domain corpora with arbitrary restrictions on document properties, such as affiliation with a sample, geographic origin, the period of time when the document was mirrored, document language, and also restrictions on the corpus size or output format (annotated XML or plain text). Scharl et al. [159] present a more detailed description of the webLyzard platform including features not mentioned here.

#### **4.4.2 Major Components of the Framework**

The following paragraphs give a more detailed description of the components of the wL-OE framework. The main architectural constituents are the (i) collection of new concept evidence which results in a semantic network, (ii) the identification of the most relevant new concepts, and (iii) the positioning of those concepts in the extended ontology and the identification of taxonomic relations. Gerhard Wohlgenannt - 978-3-631-75384-2

#### **Collection of New Domain Concepts** to **Build a Semantic Network**

The current architecture uses three modules to gather evidence on new terms, each with different functions:


Directed labeled links connect all new terms found by the evidence sources to their respective seed concepts. This creates a semantic network, which is then transformed to a spreading activation network. The link weights are computed by functions depending on the type of evidence source and features such as significance values or term frequencies. New terms extracted with trigger phrases typically gain a low weight, as such phrases are rather unreliable. WordNet also receives a low value, as domain terminology is preferred. The weights for ceroccurring terms mainly depend on their significance values. Gerhard Wohlgenannt - 978-3-631-75384-2

#### **4.4.3 Identification of the Most Relevant Concepts**

Replacing the link labels with weights in the previous step creates a spreading activation (SA) network. Spreading activation is a search technique inspired by the human brains' cognitive models, where neurons fire activations to adjacent neurons [105]; for more information on artificial neuronal networks see Section 4.2.4. The SA network acts as glue to combine the terms extracted from evidence sources. In SA processing sets of pulses are sent through the network in multiple iterations - each time also checking for termination conditions. Terms that acquire high activation levels in the SA process are elected candidate terms, which are suggested to domain experts to be included in the extended ontology.

#### **4.4.4 Concept Positioning and Taxonomy Discovery**

Positioning the new concepts in the extended ontology is the most challenging task. Liu et al. [105] propose a four-step process: (i) Accept semantic relations (hypernymy etc.) which can be confirmed with WordNet or by head noun analysis; (ii) identify modifiers of noun phrases which also appear on the list of activated concepts. (iii) Initiate another round of spreading activation where non-confirmed terms serve as seed terms in order to detect appropriate nodes to connect these concepts to. Subsumption analysis is then applied to determine eventual taxonomic relations. (iv) Consult domain experts for support in the concept positioning task.

Subsumption analysis [157] helps to automatically generate taxonomies based on the assumption that for co-occurring terms the more general term (hypernym) should appear more frequently than the specific term [105]. For two terms *x* and *y, x* subsumes *y* if *P(xly)* 2:: 0.8 and *P(ylx)* < l.

Figure 4.7 gives an example of an extended ontology created by wL-OE. The directed solid lines in the figure indicate taxonomic relations, dashed lines labeled *m* denote modifiers, *r* marks unnamed relations between two concepts. The concepts in ontologies learned by wL-OE are rather terms than concepts with rich semantic content. As mentioned, the systems generates conceptualizations including taxonomic relations, as well as additional unlabeled non-taxonomic relations. The present thesis addresses the task of labeling the non-taxonomic relations with a novel approach presented in Chapter 4.

Building on the foundations laid in the previous sections, i.e. methods for learning semantic associations, related work, and the webLyzard ontology extension architecture, Section 4.5 Gerhard Wohlgenannt - 978-3-631-75384-2 introduces and formally discusses novel

*Figure* 4. *7:* The ontology after two rounds of spreading activation

methods for learning non-taxonomic relations which combine corpus-based techniques with knowledge inferred from data available on the Semantic Web.

## **4.5 A Novel Method for Detecting Non-taxonomic Relations: Conceptual and Formal Description**

The presented supervised approach for labeling non-taxonomic relations relies on the combination of two ingredients: vector space similarity computed for verbs co-occurring with input relations, and background knowledge about concepts involved in relations which is retrieved from online semantic data Gerhard Wohlgenannt - 978-3-631-75384-2

sources. The method evolved from using vectors space models only [191] to the addition of a rather simplistic way of concept type detection by querying DBpedia [194] and finally integrated an extended mechanism for concept grounding and type detection [190].

Figure 4.8 gives an overview of the relation labeling system. The input to relation labeling consists of (i) an XML/RDF representation of the OWL domain ontology containing labeled (optional) and unlabeled relations *(14.•n•* ), (ii) the classification meta ontology which includes the classification concepts and the relation labels and as well as definitions of the relations' domain, range, and property restrictions, (ii) a natural language domain corpus, (iv) optionally additional training relation specifications to complement the named relations defined in the domain ontology, and ( v) structured information collected on the fly from online sources.

*Figure 4.8:* Overview of the relation labeling architecture [190]

Based on relations from the domain ontology the framework collects verbs from domain documents which co-occur with concepts (Cm, *Cn)* participating in the relation *14.n-* Regular expressions *c;,,,* and *C~* represent the respective concept. After verb normalization (lemmatization) the system builds verb vectors from the most significant verbs per relation - according to the *tfidf* measure. A VSM yields similarity Gerhard Wohlgenannt - 978-3-631-75384-2 scores between training relations and unlabeled relations *Rm•n•.* Finally, the semantic validation and inference process refines those similarity scores, leveraging information from external sources. The refined similarity scores are transformed to labeling suggestions for unlabeled relations.

In accordance with the development history of the method and to increase clarity this section distinguishes the vector space model based component from the improvements yielded by concept type detection and ontological constraints and definitions. Section 4.5.1 elaborates the details of relation type suggestion based solely on corpus analysis, Section 4.5.2 then presents the classification ontology and components for grounding domain concepts, and finally Section 4.5.4 describes the integration of the VSM-based approach with information inferred from structured sources to refine the relation labeling results.

#### **4.5.1 Relation Labeling Based on Vector Space Similarity**

Formally introduced in Section 4.2.3, the Vector Space Model (VSM) is a common information retrieval method used for tasks such as computing the similarity between a query and a set of documents in document retrieval, or to calculate similarities between documents themselves. The documents or queries have to be transformed into a vector representation, typically by some segmentation algorithm that generates a list of terms. The method then associates those terms with a term weight, which is, for example, simply the term frequency in the document. Terms combined with term weights constitute a vector, and similarity measures such as the cosine yield similarity scores between two vectors.

The present work transfers the idea of similarity assessment with VSMs to the learning of non-taxonomic relations. Previous approaches extract cooccurring verbs as relation labels directly [95]. But as briefly mentioned in Section 4.3.1, it is a problem to directly map co-occurrences (e.g. co-occurring verbs) to "deep" ontological relations, as those verbs often also occur in a large semantic context. The method presented here tackles this issue by not using the verbs as labels directly, but adding the co-occurring verbs as features into a VSM, aiming at the detection of more general and already axiomatized relation types.

The relation labeling method starts with collecting verbs co-occurring with the predefined relation types from domain text, and then builds verb centroids. The actual provision of label suggestions for unnamed relations is based upon a comparison of centroids Gerhard Wohlgenannt - 978-3-631-75384-2 for the unnamed relations with centroids from training relations. The following sections formally describe the training process and the computation of label suggestions for unnamed relations.

#### **Training Process**

The present description of the training process and the terminology used is adopted and extended from the work presented in Weichselbraun et al. [191] - the upper part of Figure 4.8 illustrates the order of the main tasks in the training procedure. The first step in training the relation detection component is the acquisition of a number of training examples for each relation type, typically extracted from existing ontologies or handcrafted by domain experts. Future research will incorporate bootstrapping methods to reduce the human effort involved. Each training example contains two related concepts (Cm, *Cn)* and links *Rmn(Cm, Cn)* between them. Every concept *C* is connected to *er,* which is a list of Perl-style regular expressions used to detect the concept in natural language text. The algorithm applies the regular expressions to domain-specific corpora - extracting sentences *s;* that "contain" relations *Rmn* from the training examples. Part-of-speech tags help to collect verbs occurring in those sentences. Equation 4.23 specifies the procedure more formally:

$$L\_{mn} = \{ \begin{aligned} \left| \
verbleft s(s\_i) \right| &\quad \left| \
match(\mathcal{C}\_m^r, s\_i) \land \mathit{match}(\mathcal{C}\_n^r, s\_i) \right| \\ &\quad \land \quad \left| \
idx(\mathcal{C}\_m^r, s\_i) < \mathit{idx(\mathcal{C}\_n^r, s\_i)} \right| \end{aligned} \tag{4.23}$$

The Boolean function *match(C, s;)* takes a list of regular expressions *er*  and a sentence *s;* as input, and returns *true* if at least one of the regular expressions matches. For a sentence to be considered, *both* concepts of a particular relation have to be detected in the sentence. Furthermore, the order of occurrence of the concepts is important, the function *idx(cr, s;)* yields the location of the matches in the sentences, and ensures that concept *Cm* occurs prior to the second concept *Cn.* As the direction of a relation is important, the component always adds relations with inverted direction to the training examples in order to detect and use the original and the inverted relation. We define those relations as *Rmn(Cm,Cn)* := *Rnm(Cn,Cmt <sup>1</sup> •* Finally, the verbs from the sentences found per training relation are extracted with the function *verbs(s;). Lmn* refers to the list of verbs compiled by the *verbs*  function, which characterizes the semantic relation between *Cm* and *Cn.* 

Various modifications of the *verbs(s;)* operator are of interest regarding optimizing the method and its evaluation, Gerhard Wohlgenannt - 978-3-631-75384-2 those variations can be interpreted as generating alternative knowledge bases *(KB, KB', KB"* etc.) as they alter the verbs in the verb vectors and the resulting relation centroids:


The initial version of the VSM [191, 194] simply used the frequency of verbs co-occurring with the particular relation as features for building the vectors. Observations on the data set and similarity scores between relations revealed that this favors relations with bigger vectors, i.e. relations that often appear in the corpus text, as such relations include a large subset of verbs occurring in common English language. In order to tackle this problem, the current implementation computes the most relevant verbs for each relation with a *tf-idf* measure, and only selects a fixed maximum number of verbs ( for example the 150 most significant verbs) for inclusion into the verb vector. We utilized a common variant of *tf-idf* (see Section 4.2.3) which normalizes the term frequency with the document size, i.e. the total number of terms in the document. The following Equations 4.24-4.26 define the *tf-idf* measure used: Gerhard Wohlgenannt - 978-3-631-75384-2

$$tf\_{i,j} = \frac{n\_{i,j}}{\sum\_{k} n\_{k,j}}\tag{4.24}$$

$$idf\_i = \log \frac{|D|}{d\mathcal{E}}\tag{4.25}$$

$$tf \cdot idf\_{i,j} = tf\_{i,j} \cdot idf\_i \tag{4.26}$$

Applying these equations to the situation at hand, *tfi,j* for verb i is computed as the number of times *ni,j* the verb occurs with a particular relation *j* normalized by the *size* of the relation, i.e. the number of all verbs *Lk nk,j*  occurring with that relation. The logarithmic function *log* applied to the total number of relations IDI divided by the number of distinct relations *df;*  that a verb i occurs with yields *idf;.* The first term in *tf-idf* favors verbs that are more frequent with specific relations, the second one verbs that appear with few relations.

**Example.** The following example illustrates this process. Having snippets from a domain corpus and from a set of training relations, the system extracts sentences and verbs. Table 4.3 contains a few training relations, and also a numeric identifier used to refer to them subsequently. Table 4.4 gives the regular expressions for all concepts in the example relations, those regular expressions are matched against the domain corpus snippet - yielding a set of sentences associated to each relation. Table 4.5 contains the results of the *verbs* function on those sentences, it presents three variants regarding word window size.


*Table 4.3:* Examples of training relations

#### **Corpus:**

[ ... ] (sl) **reducing** *co2* protects us from the threat of *climate change.* (s2) **sorting** out our energy generation problem will **do** two things - it will **halt**  the dumping of *co2* into the environment, which will **appease** those who **believe** this *co2* **is causing** *climate change.* (s3) the study, **paid for** by the united states *national oceanic and atmospheric administmtion,* **describes**  the marshall islands as one of the "innocent victims" of *global warming.* (s4)


Table *4.4:* Concepts occurring in example relations including associated regular expressions

the new study comes from researchers at the georgia institute of technology in atlanta, us. {s5) jerry mahlman, who **used** to **be** noaa's top climate model expert, **said** that a decade ago then-vice president al gore asked if global warming could **cause** more tornadoes. {s6) researchers working with toyota at berkeley will concentrate on consumer behavior, sounding out their view of plug-in hybrids before and after driving them. [ ... ]


Table *4.5:* Sentences found per relation including extraction variants of lemmatized verbs

Downloaded from PubFactory at 01/11/2019 05:40:36AM

<sup>14</sup>Extract *all* verbs from the respective sentence.

Ir.Extract verbs within a *sliding window of* 7 *words.* 

wExtract verbs within a *sliding window* Gerhard Wohlgenannt - 978-3-631-75384-2 *of 5 words.* 

#### **Generation of Centroids**

The extraction of the verbs per sentence for each training relation is followed by the final task in the training process, the computation of verb centroids. Equation 4.27 calculates the centroid *limn* from the list of verbs *Lmn· limn* is the verb vector for the relation *'Rmn* between the two concepts *Cm, Cn.* The operator *vsmn* yields the *n* verbs with the highest *tf-idf* significance transformed into a vector space representation. In the evaluation we experimented with *n* = 20 (include only the 20 most significant verbs into the vector) and with *n* = 150.

$$
\vec{V}\_{mn} = \begin{array}{c} \upsilon smn\_n(L\_{mn}) \end{array} \tag{4.27}
$$

In addition to the verb centroids, the knowledge base *(KB)* stores mappings from concept pairs *(Cm,Cn)* to their relation label *j* in the form of a function *Mmn*• *j·* So the mapping function connects concept pairs to relation labels *j.* 

For the examples given above, the centroids for the relations with ID 1 and 2, when extracting all verbs from the sentence, are17:


The values in the two vectors represent the *tf-idf* scores computed for this very simple case of only two training relations. The constituent for the verb *reduce,* for example, follows from a term frequency *n;,i* of 1, a "relation size" *Lk nk,i* of 9, a total number of relations IDI of 2 and a number of distinct relations *df;* of 1 where the verb occurs. The verb *cause* appears with all relations, leading to an *idf;* of 0.

<sup>17</sup>Note that verbs themselves are not included into the real vectors, only the respective significance numbers. Gerhard Wohlgenannt - 978-3-631-75384-2 Downloaded from PubFactory at 01/11/2019 05:40:36AM via free access

#### **Computation of Relation Suggestions for Unnamed Relations**

After completing the training process the system is ready to calculate label suggestions for testing relations. Those testing relations are relations *Rm•n•*  from a domain ontology between two concepts *C~, C~,* each of which is associated with a list of regular expressions *er.* For every testing relation the same procedure as for training relations is applied - resulting in verb centroids, i.e. the system locates the testing relation in the corpus, it collects verbs from matching sentences, lemmatizes them, and finally sets up the verb centroid. The system compares the verb centroid to all training relations in terms of vector space similarity (with the cosine similarity measure). Ordering the compared training relations by similarity, the relation labels of the training relations serve as label suggestions for the testing relation. A more formal description of the label suggestion procedure follows:


Table 4.6 exemplifies the process of calculating similarity values between a testing relation *Rm•n•* and training relations *Rmn* from the knowledge base. The testing relation *scientist* tt *greenhouse effect,* more precisely the verb centroid *Vm•n•* associated with it, is compared to training relations resulting in similarity scores *sim.* Table 4.7 presents the similarity-ordered list of relation label suggestions for the testing relation with no aggregation yet applied - resulting from the similarity Gerhard Wohlgenannt - 978-3-631-75384-2 scores given in Table 4.6. In this

example the training relation between *NOAA* and *global warming* is most similar, with a similarity *sim(Vm•n•, Vmn)* = 0.35, making *(j* = *study)* the best label candidate for the testing relation.


*Table 4.6:* Calculate the similarity *sim* between an unnamed relation *(Rm•n•)* and six training relations *Rmn* 


*Table 4. 7:* Order the list of relation label suggestions *j* for *scientist* +-+ *greenhouse effect* by their similarity score *sim* 

#### **4.5.2 Ontological Restrictions and Integration of External Knowledge**

The evaluation of the vector space model based relation label suggestion component showed a significant improvement in performance versus a random selection of relation labels. Nevertheless, there still is room for advancements (see Chapter 5). Integrating information from Semantic Web data sources can increase the performance of the method [194]. If the proposed framework had additional semantic information about the concepts in the relations, i.e. if it had information about their *types* according to a meta ontology, then it could use this type information together with ontological restrictions to reject some of the relation label candidates. This would naturally improve the performance of the method. The following overview summarizes the main steps involved in this procedure: Gerhard Wohlgenannt - 978-3-631-75384-2


**<sup>181</sup>n** another domain, for example in psychological research, those restrictions would be different. In psychology the object (range) Gerhard Wohlgenannt - 978-3-631-75384-2 of *study* is typically a *Person.*  Downloaded from PubFactory at 01/11/2019 05:40:36AM

*Figure 4.9:* Considering external knowledge when selecting relation labels by grounding *NOAA* and *climate change* and the application of ontological restrictions [194]

Figure 4.10 gives an overview of the processes included in the refinement method. The *semantic validation and inference component* obtains the concepts Cm• and *Cn•* participating in the unlabeled relation and also a list of relation labels including significance scores as suggested by the vector space method described above.

With the help of information from external sources like DBpedia and OpenCyc, and supported by a reasoner, the concepts Cm•, *Cn•* are grounded in the classification meta ontology. With this information the system refines the weights for any of the suggested relation labels ( from the VSM) by verifying domain, range and property constraints - which finally leads to a refined ranking of label suggestions for the unlabeled relation *Rm•n•.* 

#### **The Classification Ontology and Ontological Restrictions**

As the concepts Cm•, *Cn•* stemming from the webLyzard ontology learning system are currently terms with few further semantic annotations, it is crucial to detect additional concept *type* information in order to apply the methods just presented. The *classification meta ontology* defines the concept (meta) types for this process. Figure 4.11 presents an example of such a classification ontology (left side), and also visualizes the properties for the classification meta ontology used in this thesis. The figure presents the specification of the domain and range restrictions for the predicate *study* defined upon the meta concepts, too.

The classification ontology reflects the domain and range restrictions of predicates used as possible relation labels, and the local (property) restrictions (see Section 3.2.3 for more information about ontological restrictions in the context of OWL). Gerhard Wohlgenannt - 978-3-631-75384-2

Figure *4.10:* The process of refinement of VSM similarities based on information from external knowledge [190]

The following OWL snippet presents a fragment of the classification meta ontology, which defines domain and range for the property *study.* This definition corresponds to the visualization of the characteristics of *study*  property in Figure 4.11. Using OWL-Lite terminology we describe the *domain* of the property *study* as the union of the meta concepts *cl:Person*  and *cl:Organization* from the classification ontology. The classification type *cl: Unknoum* needs to be added to all restrictions implicitly for the case where no grounding was possible. The range of *study* is limited to *cl:ObjectTopic*  and *cl:AbstractTopic.* The QName *cl:* refers to the classification meta ontology throughout the current chapter.

```
<owl: Object Property rdf: ID=" study"> 
    <rdfs: domain> 
      <owl: Class> 
        <owl: unionOf rdf: parseType="Collection"> 
           <owl:Class rdf:about="cl:Person"/> 
           <owl:Class rdf:about="cl:0rganization"/> 
           <owl:Class rdf:about="cl:Unknovn"/> 
         </owl: union Of> Gerhard Wohlgenannt - 978-3-631-75384-2
                     Downloaded from PubFactory at 01/11/2019 05:40:36AM
                                                      via free access
```
*Figure 4.11:* Classification meta ontology including an example specification of domain and range restrictions for the *study* relation [190]

```
</owl: Class> 
    </rdfs :domain> 
    <rdfs: range> 
      <owl: Class> 
        <owl: union Of rdf: parseType=" Collection"> 
           <owl:Class rdf:about="cl:0bjectTopic"/> 
           <owl:Class rdf:about="cl:AbstractTopic"/> 
           <owl:Class rdf:about="cl:Unknown"/> 
        </owl: unionOf> 
      </owl: Class> 
    </ rd fs : range> 
</owl: Object Property>
```
For all other properties used as relation labels in the domain ontology a similar process of adding domain and range restriction statements applies. Chapter 5 provides more information about restrictions used for other relations. Future work will attempt to automatically learn the restrictions, e.g. from training relations (see Chapter 6).

Local restrictions on concepts in the domain ontology also help to specify the correct usage of relation labels. The snippet below defines local restrictions on the meta concept cl: **Organization** regarding the *cl:subClassOJ*  property. It states that if the subject in a *cl:subClassOJ* relation is of type *Organization,* then also the object has to be of type *Organization.* Local restrictions allow fine-grained specifications of the usage of domain relations and will facilitate more precise filtering and weighting in the link detection process subsequently.

```
<owl: Class rdf: ID=" Organization"> 
    <rdfs: subClassOf> 
         <owl: Restriction> Gerhard Wohlgenannt - 978-3-631-75384-2
                      Downloaded from PubFactory at 01/11/2019 05:40:36AM
                                                          via free access
```
<owl:onProperty rdf:resource="#subClass0f" /> <owl: allValuesFrom rdf: resource="#0rganization" /> </owl: Restriction> </rdfs: subClassOf> </owl: Class>

#### **Concept Grounding**

For the application of the presented restrictions in the relation label suggestion process we need to map concepts from unnamed relations to concept (meta) types from the classification ontology. We applied two strategies for grounding concepts in the current work. The first prototype relies on queries against DBpedia pages for respective resources [194], a more sophisticated successive implementation involves a reasoning-based approach on ontologies linked by the DBpedia page.

**Queries against DBpedia.** A simple way to guess the type of a term representing a concept is to exploit the data about this term which resides directly in the corresponding DBpedia page (and eventually the Freebase resource linked in that page). For example, if a resource has the property http://dbpedia.org/ontology/birthdate, then we can infer that it is an instance of *Person,* as this is stated by domain restrictions for that property in the DBpedia ontology. Similarly, if a resource has the property http://dbpedia.org/property/parentagency, then it is likely to be an *Organization.* If Freebase reveals that something is of type base. science, we assume the resource to be an *AbstractTopic.* 

The following listing sketches the procedure applied to the detect type information for a concept with the *queries against DBpedia* method:


**Reasoning on External Ontologies.** This section describes a more sophisticated method to detect the *type* of concepts according to a classification ontology. The basic idea is to extract links to external ontologies ( currently: OpenCyc, DBpedia ontology) from the DBpedia page, and then apply reasoning techniques on those ontologies to determine if the resource is a subclass of a predefined grounding meta concept.

Before classifying individual resources, we have to create the inferred models and to define of classification ontology mappings:


```
<owl: Class rdf: ID=" Organization"> 
  <owl: union Of rdf: parseType="Collection "> 
    <owl: Class 
        rdf:about="http://sw.opencyc.org/concept/Mx4r ... " /> 
    <owl: Class 
        rdf:about="http://dbpedia.org/ontology/Organisation" /> 
  </owl: unionOf> 
</owl: Class>
```
After the acquisition of the DBpedia page for a concept label with the procedure described **in** step 1-5 for the method *queries against DBpedia,* the first step in the actual classification process is to collect links to external ontologies, therefore we exploit **owl:** Gerhard Wohlgenannt - 978-3-631-75384-2 **sameAs** and **rdf: type** properties occurring

in the respective DBpedia page. The crucial element of the procedure is then to check via SPARQL queries against the inferred ontology models if the resource is a subclass of any mapping classes defined in the classification meta ontology. Figure 4.12 gives a visual impression of the results of this technique: The DBpedia resource http://dbpedia.org/resource/Scientist, which represents the domain concept "scientist", contains an owl: sameAs link into OpenCyc. It follows from the OpenCyc ontology that OpenCyc: Scientist is a subconcept of OpenCyc: Person. According to the definitions in the classification ontology, the system maps OpenCyc : Person to cl : Person. For demonstration purposes Figure 4.12 also includes a branch leading to no classification results.

*Figure 4.12:* Reasoning example for the concept label *scientist* 

Figure 4.13 includes another example of concept type detection with the help of ontological reasoning. This time the concept grounding component finally maps the concept "NOAA" to Organization. The mapping process is a little more complicated, it involves the resolution of a DBpedia redirect and also shows a longer reasoning chain.

The result from concept grounding is a set of ontology fragments which ground the domain concepts in the classification meta ontology, as illustrated in the statements below: Gerhard Wohlgenannt - 978-3-631-75384-2

*Figure 4.13:* Reasoning example for the concept label *NOAA* [190]

```
<!-- information derived from reasoning -> 
<http://dbpedia.org/resourcejNOAA> 
  <rdf:type rdf:resource="tcl;0rganization"/> 
<http://dbpedia.org/ resource/ Scientist> 
  <rd fs : su bClassOf rdf: resource="tcl; Person"/>
```
#### **4.5.3 The Knowledge Base**

The knowledge base (KB) for the relation detection framework emerges from the mechanisms described in this section, i.e. it contains the data generated in the various steps. The final step, the calculation of relation suggestions, is executed upon this KB. The following Gerhard Wohlgenannt - 978-3-631-75384-2 constituents make up the KB:


$$KB = (\{\vec{V}\_{m\_1 n\_1}, \dots \vec{V}\_{m\_k n\_l}\}, M\_{mn \to j}, \mathcal{O}, \mathcal{O}\_d, \{\mathcal{O}\_1, \dots, \mathcal{O}\_n\}) \tag{4.28}$$

#### **4.5.4 A Hybrid Method for Relation Labeling**

The final step in the presented framework is to label relations based on the information compiled within the knowledge base. As mentioned, the refined labeling suggestions build on the similarities computed by the VSM, i.e. the similarities *sim(Vm•n•, Vmn)* between the unlabeled relations and all the centroids for training relations from the knowledge base. The system then combines these similarity scores with domain knowledge. Equation 4.29 outlines the process:

$$sim\_{mn} = w\_{o,m^\*n^\*} \underbrace{(M\_{mn\to j}(\mathcal{C}\_m, \mathcal{C}\_n))}\_{j} \cdot sim(\vec{V}\_{m^\*n^\*}, \vec{V}\_{mn}) \tag{4.29}$$

The outcome of the multiplication of the weighting factor *Wo,m•n•* with the similarity score from the VSM results in an enhanced similarity value *simmn* between an unlabeled relation *Rm•n•* and a training relation *'Rmn-*

The weighting factor *Wo,m•n•* applies to an unlabeled relation *R;,.n* and a particular relation label *j.* The weighting factor's purpose is to integrate domain knowledge, Equation 4.30 describes the heuristic used to compute it:

$$w\_{o,m^\*n^\*}(j) = \begin{cases} 1.0 & \text{if } \mathcal{O} \vdash \mathcal{C}\_{m^\*} \in dom(j) \\ & \mathcal{O} \vdash \mathcal{C}\_{n^\*} \in range(j) \land \mathcal{O}(j(\mathcal{C}\_{m^\*}, \mathcal{C}\_{n^\*})) \\ 0.01 & \text{if } \mathcal{O} \vdash \mathcal{C}\_{m^\*} \notin dom(j) \quad \lor \\ & \mathcal{C}\_{n^\*} \notin range(j) \lor \neg \mathcal{O}(j(\mathcal{C}\_{m^\*}, \mathcal{C}\_{n^\*})) \\ 0.8 & \text{if } \mathcal{O} \vdash \mathcal{C}\_{m^\*} \in dom(j) \quad \lor \\ & \mathcal{C}\_{n^\*} \in range(j) \\ 0.6 & \text{otherwise} \land \text{Nulls} \& \text{otherwise} \cdot 978\text{-}3\text{-}63\text{-}75384\text{-}2 \\ & \text{boundaryed from Proof:actority at 0.1\'1/2019 0.540:3604\text{M}} \end{cases} \tag{4.30}$$

via free access

Equation 4.30 yields the weight *Wo,m•n•* by checking if the knowledge base (generated in previous steps) supports the domain and range restriction, as well as local restrictions, for the label suggestion *j* and combination with concepts *Cm•, Cn•.* We applied a set of fixed weights depending on the level of correspondence with the restrictions. Those weights were chosen in an intuitive and ad-hoc fashion and performed well in the experiments. We chose the weights independent of the evaluations, and they are therefore also applicable on other datasets and domains. Future research will optimizing the weights generally and for specific applications.

If the concept grounding component successfully detects the type ( according to the classification ontology) of both concepts involved in the unlabeled relation, and if these concepts fulfill all restrictions defined for a label *j* (i.e., the ontology snippets *('.)* imply ( F) that domain, range and property restrictions are met), then a weighting factor of 1.0 results. This means that the subject satisfies the domain restrictions, the object satisfies the range restrictions and also property restrictions are fulfilled. If the system can only detect the concept type of one of the two concepts involved in the relation, and that concept fulfills the restrictions, then the system applies a weighting factor of 0.8 in Equation 4.29. In situations where the types of both concepts are unknown, we have no additional evidence on the correctness of a candidate label. If restrictions cannot be verified, a weighting factor of 0.6 results. Finally, if the concepts are in conflict with restrictions, the system yields a factor of 0.01 - which has the effect that the suggestion will be ranked at the very end, but the original order of suggestions from the VSM is not completely lost.

Table 4.8 presents an example for the computation of *simmn* as specified in Equation 4.29. The example compares the unlabeled relation *scientist* tt *greenhouse effect* to four training relations. The process starts with similarity scores from the VSM *(sim* := *sim(Vm•n•, Vmn))* given in the column *sim.*  The mapping function *Mmn---->J* simply yields the relation label *j* for concept pairs from training relations. The VSM similarity scores are adjusted by the weighting factor *Wo,m•n•,* which combines external domain knowledge ( concept grounding) with ontological restrictions - finally resulting into *simmn·* 

The example in Table 4.8 distinguishes four cases of success in concept grounding, which are reflected by four rows per training relation in the table. Either both concepts could be grounded, or just the subject concept or the object concept, and finally there is also a case where none of the concepts could be grounded. The symbol "-" indicates failure in the grounding process. The symbol "c" refers to concepts which conform to the restrictions, "v" marks violations of domain, range Gerhard Wohlgenannt - 978-3-631-75384-2 or property restrictions.


*Table 4. 8:* Relation label suggestion for the relation *scientist ( cl: Per son)* • *greenhouse effect {cl:Object1'opic),* the letters "c" and "v" indicate that information is "corresponding to" or "violating" ontological constraints. "-" implies that grounding was not successful for the concept

The VSM yields the similarity values *sim* of 0.33, 0.3, 0.31, and 0.29 between the unlabeled relation *scientist* -+ *greenhouse effect* and the four training relations. The relation labels suggested by the training relations and determined with the mapping function are *subClassOf, study, studiedBy*  and *study.* 

In case of the first training relation, with both concepts from the unlabeled relation grounded successfully, we have a weighting factor of 0.01, as the *subClassOJ* predicate includes local restrictions which basically state that the subject and object have to be of the same classification type. So the types *Person* and *ObjectTopic* conflict with this restriction. In the other three cases where one or both concept types are unknown we have no conflicts for *subClassOJ* with ontological restrictions, but a certain level of uncertainty, which results in the factors 0.8 and 0.6 respectively.

The restrictions defined on the *study* relation are consistent with the grounding results, this leads to a factor of 1.0 when grounding was successful. The *studiedBy* relation, on Gerhard Wohlgenannt - 978-3-631-75384-2 the other hand, has a domain of { *Topic,* 

*Unknown}* and a range of *{Person, Organization, Unknown}* - so if any of the concepts can be grounded this induces a conflict.

A consolidated view on the example indicates that when relying on the VSM only, the relation label *subClassOf* possesses the highest similarity score to the unlabeled relation *scientist* • *greenhouse effect.* But with the integration of domain knowledge in the form of concept grounding the method prefers the *study* relation - in the case where grounding was successful for both concepts from the unlabeled relation.

The last step which finally results in an ordered list of relation label suggestions starts with sorting the candidates (which is the list of training relations) for any unnamed relation by the similarity value *simmn,* as computed in Equation 4.29. The algorithm then translates this candidate list into a list of relation labels *j* with the help of the mapping function *Mmn-+j*  applied to the training relations *'Rmn·* As already mentioned in Section 4.5.1 we experimented with using the unmodified ordered list of suggestions, and also included aggregation mechanisms, so the method suggests the labels following one of three strategies:


#### **4.5.5 Integration of User Feedback**

The *integration of user feedback* extends the initial knowledge base built from training relations on-the-fly. The knowledge base *(KB)* includes all training relations, i.e. known relations with their types from the domain ontology, and additionally relations manually compiled by domain experts as training data. The *integration of user feedback* component, which is optional, adds testing relations to the *KB* as new training data after being validated by a domain expert. Domain experts Gerhard Wohlgenannt - 978-3-631-75384-2 either confirm or discard the addition of a new relation. If confirmed, the system adds the relation *Rm,n1* and the centroid representing it *Vm,n1 ,* as well as and its mapping *Mm,n,-~j,* to the *KB.* If the newly added relations are in conflict with the definitions from the classification meta ontology, then the architecture reports feedback to an ontology engineer who either updates the classification ontology or discards the new information. The availability of an increasing set of pre-learned centroids, and the updates on the classification ontology, help to constantly improve the performance of the method.

## **4.6 Implementation of the Method**

The method presented in Section 4.5 was implemented using the Python programming language19. We separated the relation suggestion architecture into a few packages which represent the major components of the system, i.e. packages containing the various modules for training the system, for computing similarities between vectors, for concept grounding and modules that integrate external sources. Finally, there are packages for generating suggestions and evaluating the approach. The components are complemented by modules that provide common tools shared among the packages, such as database access, configuration handling, handling of CSV data and many others.

The application makes heavy use of a database driven by the PostgreSQL20 database management system. The architecture stores almost all data, e.g. corpus definitions, training and testing relations, sentences, verb vectors, grounding results and evaluations in that database. This helps to modularize the application and to serialize tasks. However, the domain corpora are not included into the database. In the case of large text corpora, a database system has little advantage over storage in the file system.

In Section 4.6.1 we shed light on the implementation of the training process, i.e. the extraction of verbs occurring with the relations' concepts, and then continue with the realization of the computation of vector space similarities in 4.6.2. Section 4.6.3 describes the modules for concept grounding, followed by a brief discussion of the code to access the Scarlet RelationFinder in 4.6.4. This section concludes with an overview over the evaluation package and related configuration settings in 4.6.5.

<sup>19</sup>http://www.python.org

<sup>20</sup>http://www.postgresql.org Gerhard Wohlgenannt - 978-3-631-75384-2 Downloaded from PubFactory at 01/11/2019 05:40:36AM via free access

*Figure 4.14:* Component diagram of metadata and relation specification

#### **4.6.1 Training**

The first step in applying the architecture is to specify high level metadata about knowledge domains and associated text corpora to be used in the system, and also to give training relations. Figure 4.14 shows a component diagram of the involved modules.

At first the user specifies a domain, which basically consists of a domain name. Every corpus is associated with a domain, corpus definitions also include metadata about corpus text files, more precisely XML-annotated natural language text and the additional file with contains the corresponding part-of-speech (POS) tags. The specification of training relations is a more complex task, which includes the automatic creation of corresponding regular expressions. Users specify training relations via CSV data files. Like corpora, training relations are associated with a domain. For every relation the system automatically generates a relation with inverted direction, for example for *scientist* - *study* - *greenhouse effect* it generates a relation *greenhouse effect* - *studiedBy* - *scientist.* For every relation the metadata upload scripts create entries in a *concept* table for subjects and objects in that relation. The concept table also includes the regular expressions, the application automatically generates them with the help of some heuristics which compile singular and plural forms of the terms. A domain expert can extend those regular expressions if necessary- for large scale implementation of the architecture this will be not feasible, in this case synonyms or WordNet senses detected in earlier phases of Gerhard Wohlgenannt - 978-3-631-75384-2 ontology learning should be applied.

*Figure 4.15:* Component diagram of sentence collection and verb extraction

After the definition of basic metadata about domains, corpora and relations, the next step is the collection of sentences with matching relations. This process has domain and corpus specifications as input, and then compiles a list of associated relations and corpus file handles. The actual computational work is done by nested loops iterating over corpora and relations. The component splits every corpus into sentences, and then matches them against the relations' regular expression patterns, resulting in matches. The matches include the match positions necessary for upcoming analyses. Figure 4.15 gives an overview of the involved components. Finally, the application saves the matching sentences (including POS tags) and the so-called *evidences,* i.e. the metadata about relation matches per sentence including match positions, to the database.

In order to compile verb vectors representing relations, the application needs to generate verb lists from sentences where the relations occur. This task needs to distinguish various modes of verb collection, e.g. to collect all verbs in the sentence or to apply restrictions, such as sliding windows of various size (five or seven words). Those modes are named *verbmodes;*  in addition to the determination of word windows, the *verbmodes* specify whether or not to extract prepositions Gerhard Wohlgenannt - 978-3-631-75384-2 directly following verbs.

When extracting verbs from sentences, the current implementation minimizes the distance between subject and object concept labels if multiple occurrences are detected, and is also capable of discovering and utilizing multiple occurrences of entire relation patterns per sentence. The verb extraction module iterates over all sentence evidences found in the previous step and basically compiles a list of verbs per relation for any *verbmode* defined. Currently it generates about ten lists for combinations of various verb window sizes with the eventual collection of prepositions. Those verb lists are then lemmatized and saved into the respective database tables. An additional component computes significance scores from the verb lists with the *tf-idf* measure for every combination of relation, verb, and verbmode. The *tf-idf* component includes several verb filtering steps, such as filtering verbs with a size smaller than two characters (which result from wrong POS-tags), and filtering verbs that occur with less than ten relations in the corpus.

With the conclusion of the verb extraction and *tf-idf* generation step we have collected all necessary ingredients to compile verb vectors for relations depending on a *verbmode.* 

#### **4.6.2 Compute Vector Space Similarities**

After training the system with a set of training relations for any predicate, the application computes similarity scores between testing and training relations. At first it constructs verb vectors for the unnamed (testing) relations - for this step the procedure presented in Section 4.6.1 applies to the testing relations analogously. The system handles training and testing relations in a similar way, the database tables simply include flags to distinguish them. The following pseudocode gives a conceptual impression of what the *similarity* component does:

```
Load all training and testing relations from the database for 
   the specified configuration 
For every relation in training + testing relations: 
  For every verbmode: 
    Load verbs from the database ordered by tf-idf significance 
    Assemble the verb vector 
For every testing relation: 
  For every training relation: 
    For every verbmode: 
      Compute s im i lari t y score 
      Save score to the database Gerhard Wohlgenannt - 978-3-631-75384-2
```
It is worth mentioning that the user needs to specify a *domain* as configuration parameter, this determines the training relation set to be used. The *verbmode* setting determines the verb vectors to be loaded, if not configured the system computes similarities for all *verbmodes.* The final loop in the pseudocode presented above shows that every testing relation is compared to every training relation - this is done for any verbmode (by default). This easily leads to a large number of resulting similarity scores. In our experiments (see Chapter 5) we used around 300 training and testing relations and ten verbmodes (300 x 300 x 10 = 900.000 database entries). For debugging purposes and for the visualization of similarity scores, an HTML table generation module allows for graphical presentation of similarity matrices between sets of training and testing relations.

### **4.6.3 Ontological Restrictions and Concept Grounding**

Having computed the similarity scores between testing and training relations with the VSM-based method, we may refine those scores by checking the conformity of suggested relation labels to ontological restrictions in order to improve the precision of the overall method. This process can be divided into a few basic elements:


#### **Concept Grounding by Querying DBpedia**

As outlined in Section 4.5, concept grounding evolved from querying DBpedia and Freebase.com with SPARQL/RDQL to ontology reasoning with external ontologies linked from DBpedia. The implementation of both methods **will**  be presented, starting with the initial approach of querying DBpedia.

The input to concept grounding are the concepts occurring in relations. At first the application fetches respective Gerhard Wohlgenannt - 978-3-631-75384-2 RDF data from DBpedia for every

Figure *4.16:* Component diagram of concept type detection by querying DBpedia with SPARQL/RDQL

concept label. To facilitate this task the application converts the input into DBpedia notation and makes an HTTP call to http://dbpedia.org/data/ **TERM.** If no entry exists, the program halts, since no grounding is possible. If the DBpedia page contains a DBpedia redirect, the system downloads the referred page. We store the fetched RDF data in the filesystem. After processing all new concepts, the controller iteratively loads their RDF data from the filesystem into the Redland RDF parser (Raptor, see Section 3.3.3) in order to build an RDF model. Figure 4.16 gives a graphical overview of the involved components.

Specific to concept grounding by querying DBpedia, we use a set of predefined queries per classification concept aiming to detect the type. For example if an RDF model matches the query

```
SELECI' ?a ?c WHERE (?a dbpedia: birthPlace ?c) 
USING dbpedia FOR <http://dbpedia.org/ property/>
```
the program assumes that the concept in question is of type *cl:Person,*  because only instances of the concept person typically have *birth dates.* Another example is

```
SELECT ? a WHERE (? a 
   <http:/ /www. w3. org/1999/02/22- rdf-syntax-ns#type> 
yago: Internationa!Organizations) USING yago FOR 
   <http://dbpedia.org/ c I ass/ yago/>
```
which includes links to *YAG021,* a comprehensive semantic database, and maps the concept to *Organization.* The system includes about 20 of these query patterns. This initial implementation of concept type detection includes a number of drawbacks, which are the reason for switching to more sophisticated reasoning as the system evolved. These patterns have to be compiled by manually scanning through DBpedia RDF files to find clues for potential patterns. Recall was low and conflicts occurred ( domain concepts matching to patterns of different classification types).

Many DBpedia resources contain links to Freebase.com in form of the OWL *sameAs* property. Exploring these links and handling the linked structured data just as DBpedia data helped improve recall (Download from Freebase.com, create an RDF model from it, and query the model with a predefined set of queries). The following example shows an RDQL query to map concepts to the type *Person:* 

```
SELF,CI' ?c WHERE (<Concept-Freebase-URI> 
   <http://www. w3. org /1999 /02/22- rdf-syntax-ns#type> ?c) 
AND ?c =- /base.people/i
```
#### **Concept Grounding by Ontology Reasoning**

Grounding concepts with ontology reasoning provides crucial advantages over simple graph queries as described above, since it requires less manual input and is less ad-hoc. The method has two major prerequisites, which are database tables or triple stores that contain all triples expanded from the external ontologies, and a definition of a linkage between external concepts and classification ontology concepts.

We apply a reasoner from the Jena framework (see Section 3.3.2), more precisely the OWL Micro Reasoner22 , to generate inferred statements for an input ontology. These statements are stored persistently in a PostgreSQL database. The framework uses OpenCyc, DBpedia, and Umbel ontologies as input, and saves the inferred model for each in a distinct database table as triple data. The second prerequisite are mappings to the classification ontology. We define classification concepts as a collection of concepts from external ontologies (see Chapter 4).

The actual type detection starts with the extraction of all statements in the DBpedia file found for a domain concept which contain *rdf:type* or *owl:sameAs* properties that link the DBpedia entry to an external ontology. The module iterates over those concepts found in external ontologies and tries

<sup>21</sup> http://'IIVW.mpi-inf.mpg.de/yago-naga/yago

<sup>22</sup>http://jena.sourceforge.net/inference Gerhard Wohlgenannt - 978-3-631-75384-2 Downloaded from PubFactory at 01/11/2019 05:40:36AM via free access

to determine their type by checking in the corresponding database table if they are a subconcept ( or direct match) of a concept used in the classification ontology collections. In case of multiple types, we apply simple rules to choose a type and save the result to the database. The current preference rules select the first type found in an ordered list (Person, Organization, etc.).

#### **Ontological Restrictions**

To make use of ontological restrictions for this application, the system needs to transform them into some kind of lookup function. This lookup function has the two concepts from the classification meta ontology (subject and object) and the suggested relation label as input, and returns a weighting factor. Equation 4.30 in Section 4.5 gives a formal description of the semantic verification variants including the resulting weights. The application cannot use the restrictions defined in OWL directly, therefore we transform the domain and range, and also local restrictions, into a database table. This table obviously includes the subject-type, object-type, relation-label and the weight associated with this triple - so the lookup function simply does a query on this table to determine the correct weight. The pseudocode to translate all combinations of subject, object and relation label into a lookup table is:

```
For every relation label: 
  For every subject type: 
    For every object type: 
      check if the combination fulfills all restrictions 
          (domain, range , property) 
      # compute weight depending on number of "unknown" types 
      If all restrictions fulfilled: 
        if both known: 1.0 
        else if one known: 0.8 
        else: 0.6 
      else: 
        weight = 0.01 
      write weight to db
```
The application then refines all similarity-score entries in the similarity database table. More precisely, the score from the VSM gets multiplied with the weight computed assisted by ontological restrictions.

#### **4.6.4 Scarlet**

The framework also integrates a component to call the external service *Scarlet*  (see Section 4.3.5) to retrieve a relation Gerhard Wohlgenannt - 978-3-631-75384-2 label between two concept labels by

lookup in online ontologies. We save the relation labels found by Scarlet, if any, into the *relation* database table, and optionally use them in the course of the evaluations. We customized the example client provided as part of the Scarlet APL

The key method call is *findRelationBetweenTerms(termA, termB),* which returns the relation between input objects found. The Scarlet RelationFinder class includes a number of configuration options (see Section 4.3.5), we basically configured Scarlet to detect as many relations as possible by setting inference depth to a high value and looking for all types of relations.

A Python process iterates over all the relations defined in our system for a particular domain. It calls the Scarlet Relation finder via a system call for each relation and parses the output in order to detect the relation label. A mapping function transforms this label into a predicate identifier, which is finally stored in the database.

#### **4.6.5 Evaluation**

The evaluation component ranks and aggregates results from similarity computations based on a set of input parameters in order to generate statistical data about the performance of the relation labeling system.

We always evaluate the given measures (ARP, first guess correct, second guess correct, see Chapter 5) for two alternatives, which are *directed* and *nondirected.* The mode *directed* implies that relation type and relation direction must be correct for a suggestion to be accepted.

The component allows various input options as it aims at evaluating multiple implementation variants and to compare their impact on performance. Among these options are:


Based on the described configuration settings, this section concludes with a brief overview of the internal logic of the evaluation module. Having loaded the testing relations from the database, nested loops iterate over the given verbmodes and suggestion-modes, which are evaluated separately. An inner loop examines the set of testing relations. The first and crucial step is to create an ordered list of distinct relation label suggestions based on the training relations for each testing relation - the order depends on the actual configuration settings. Then the system assesses the rank of the correct relation label in the list of label suggestions, for both the *directed* and *non-directed*  variants. With this information we can finally calculate the statistical scores mentioned above for all testing relations.

# **Chapter 5 Results and Evaluation**

This chapter contains the results of an extensive set of experiments conducted to evaluate the performance of the method presented in this thesis, as well as the various subcomponents involved. Due to previous experience in the domain, e.g. with climate change portals1 [158] initially developed within the IDIOM project2, the experiments were conducted in the domain of *climate change.* As the thesis focuses on the task of relation label detection, we let domain experts manually extend a set of relations identified by the webLyzard ontology extension architecture [105] with the aim to have an extensive training base, resulting in 313 distinct relations *('R,).* Adding the relations with the concepts in reverse order **(n- 1,** e.g. scientist *study* green<sup>h</sup>*a* h *a studiedBy* . . ) . . d ouse\_euect; green ouse\_euect ----+ sc1ent1st m an automatic proce ure complements the initial set of relations, providing 626 relations finally used in the evaluation process. Differentiating between the two directions of a relation is necessary to distinguish active and passive form of a relation.

The current chapter starts with a specification of the domain corpus utilized in the evaluations, as well as the predefined predicates and the manually classified training relations in Section 5.1. Section 5.2 evaluates the accuracy of relation label suggestions solely based on the vector space model regarding a number of performance measures and various configuration settings. A discussion of the results of concept grounding according to the classification ontology follows in Section 5.3. After presenting the conclusions from experiments with Scarlet in Section 5.4, the accuracy of the hybrid relation labeling approach combining VSM with semantic inference and validation is assessed in Section 5.5.

<sup>1</sup>http://wvv.ecoresearch.net/climate

<sup>2</sup>http://wvv.idiom.at Gerhard Wohlgenannt - 978-3-631-75384-2 Downloaded from PubFactory at 01/11/2019 05:40:36AM via free access

## **5.1 Domain Relations and Domain Corpus**

Collecting a significant number of verbs that co-occur with domain relations requires a domain corpus of sufficient size. The evaluation presented in this section relies on domain corpora collected by the webLyzard suite of Web mirroring and analysis tools.3 Crawling 156 news media sites selected from the Newslink.org, Kidon.com and ABYZNewsLinks.com directories provided a basic domain corpus, which was complemented by separate corpora built from Web sites of NGOs and from environmental biogs gathered via APis such as Google Blog Search4 and Technorati.5 The crawler collected around 200,000 Web pages from news media sites in a weekly interval. A domain detection service based upon matching regular expressions against the crawled documents provided content specific to the climate change domain. The described sources yielded a total of 290,096 documents between October 2008 and March 2009.

As mentioned above, we used 626 relations in the evaluation process. In order to raise the robustness of the vector space model (VSM) related methods we eliminated relations that are not sufficiently represented in the domain corpus. Removing relations that occur in less than 10 distinct sentences in the corpus resulted in an evaluation set of 461 relations remaining.

Table 5.1 lists the relation types used as well as the number of distinct sentences in all the corpora that satisfy Equation 4.23 (see Section 4.5.1). Most ontology learning systems acquire taxonomic *subClassOJ* relations separated from non-taxonomic relations; this is also true for the webLyzard ontology learning framework, where the basic ontology learning components detect *subClassOJ* with a number of techniques. The *subClassOJ* predicate is among the set of relation types in the present evaluation for cases where the components that detect taxonomic relations failed to identify the relation type. *study* and *takeActionBy* have very strict domain and range restrictions, with a domain of *Person, Organization* and a range of *Topic* - for the observed climate change conceptualization.

The sentences extracted from the corpora form the basis for building the per-relation verb vectors. The total number of sentences in the evaluation database is 126,163. Multiple relations may match in one sentence, resulting in a number of 160,456 *evidences* (matches of a relation in a distinct sentence). The same sentence may appear several times in the corpus, this is especially true because we use pages from Web sites mirrored at regular intervals, which sometimes include evolving versions of documents. In summary

**<sup>3</sup>http://www.weblyzard.com** 

**<sup>4</sup>http://blogsearch.google.com** 


Table *5.1:* Relation types used in the evaluation and number of sentences found per relation in the corpora

the corpus parsing modules found 241,353 matches, i.e. relations matching in a sentence, where one sentence may occur multiple times6•


*Table 5.2:* Example regular expression patterns for concepts *Cm, Cn* 

We used around 50 pre-defined concept-relation patterns for each of the relation types given in Table 5.1 (for exact numbers see below), Table 5.2 gives examples of such learning patterns. An example of a regular expression for the relation *company* tt *gasoline* as it is applied to the domain corpora using the Python programming language is:

I { *C* I • . \*? \ **W)** { com pan {?: y I i es ) ) **(\WI\ W.** \*? \ **W)** ( petrol I gasoline) **(\WI\$)** ) I

The regular expression captures the terms themselves as well as the surrounding text, which is needed for further analyses, such as the word distance between concept representations, etc. The formulation of the regular expressions is currently done in a semi-automatic fashion based on concept labels, a domain expert checks and eventually extends the regular expressions which cover plural inflictions automatically generated by the system.

Table 5.3 presents the number of relations predefined by domain experts per relation type. As mentioned above, we only considered relations that are reflected in the domain corpora. More precisely, the relation has to occur in at least 10 distinct sentences - the column *Filtered Number* gives the number of relations that exceed this threshold.

<sup>6</sup>Whenever speaking about matches of relations in a sentence, this refers to matches of the lexical representation of the two concepts involved in the relation with the help of regular expressions. Gerhard Wohlgenannt - 978-3-631-75384-2


*Table 5.3:* The original and utilized number of relations per relation type

The correct relation labels were originally provided by of two domain experts, and then validated with two other experts independently (interexpert agreement of 90.2%). Most cases of non-conformance referred to ambiguities between *takeActionBy/actionTakenBy* and *effectOn/affectedBy,* a smaller percentage of disagreement also applied to *subClassOJ/superClassOJ*  versus *effectOn/affectedBy.* For the predicates *use/usedBy* and *study/studiedBy* dissent among domain experts was rare.

## **5.2 Evaluation of the Vector Space Model**

This section presents the evaluation of the VSM-based approach for relation labeling, and compares different variants of the method. The basic prerequisite for the application of the VSM is the extraction of verbs from the domain corpora, which is done according to the *verbs* function as described in Equation 4.23 on sentences matching a particular relation. The number of extracted verbs depends on the mode of extraction (whole sentence or sliding windows, see Section 4.5.1). In the *whole sentence* mode, for example, the average number of verbs extracted was 1,398.88 per relation, with a maximal value of 41,039. The frequencies for a *sliding window of seven words* are naturally lower, with 313.58 verbs on average and a maximal frequency of 8,734 verbs. Those verbs were used to generate the *tf-idf* significance scores, which in turn are the basis for the verb vectors. In the experiments we evaluated the performance of two thresholds on verb selection, namely to use the 20 verbs with highest *tf-idf* significance per relation, as well as the 150 verbs with highest significance. The lower value of 20 has the advantage that only the verbs most significant for a relation Gerhard Wohlgenannt - 978-3-631-75384-2 are included in the vector, and also

that computational complexity is lower. Selecting 150 verbs on the other hand integrates a wider spectrum of associated verbs and leads to a broader overlap with verbs from other relations.

The system randomly splits the full set of 461 relations into training and testing relations of equal number per predicate for every single evaluation run in order to avoid a selection bias. The whole evaluation process consists of seven such runs, the upcoming data tables present the average over the runs - the effect of random bias is therefore minimized.

#### **5.2.1 Evaluation Baselines**

The results from the VSM-based method are compared to two baseline references: a random baseline and a relation label suggestion method from literature adopted to our scenario. The random baseline simply suggests a relation label for a testing relation randomly. In the case of ten possible relation labels, for example, the chance to select the correct one on first guess is 1/10 (10%).

Kavalec and Svatek [95] propose the heuristic *above expectation* measure for the task of relation labeling (see also Section 4.3.1). We used an adopted version of this approach as baseline for the evaluation of the methods elaborated in the present thesis. The *Above Expectation ( AE)* measure'compares the observed frequency of co-occurrence of two concepts in a specific relation to the expected frequency under the assumption of independence. Above expectation is calculated as:

$$AE(c\_1 \wedge c\_2 / v) = \frac{P(c\_1 \wedge c\_2 / v)}{P(c\_1 / v) \cdot P(c\_2 / v)}$$

The conditional frequency of the concept pair (Ct, c2) and a verb *v* is compared to the expected conditional frequency of Ct and verb *v* multiplied by *P(c2/v)* for concept c2. Based on this measure, Kavalec and Svatek [95] compile lists of verbs ordered by *above expectation* as label candidates, which are evaluated for equality or synonymity against labels suggested by domain experts. In order to make the approach comparable with our methods, and to integrate it into the evaluation framework, it was adopted as follows: The system computes AE scores for all verbs occurring with testing relations and also training relations, and then selects the best four7 verbs for any testing relation. The similarity of the testing and training relations is computed upon the position of those four verbs in the AE list of the training relation

<sup>7</sup>We experimented with various sizes, using the four verbs with highest AE performed best in our experiments Gerhard Wohlgenannt - 978-3-631-75384-2 Downloaded from PubFactory at 01/11/2019 05:40:36AM via free access


*Table 5.4:* Configuration settings for the VSM-based relation suggestion


#### **5.2.2 Configuration Parameters**

The evaluation results suggest that there is not a single setting or variant that works best under all circumstances and for all evaluation measures. Table 5.4 summarizes the most important settings used in the evaluation process.

The sequence of configuration settings in Table 5.4 reflects the position in the system architecture. Verb extraction precedes the building of vectors and the corresponding computation of similarity values, and finally those scores are filtered and aggregated. By combining the configuration parameters described a big number of possible Gerhard Wohlgenannt - 978-3-631-75384-2 combinations emerges. We present evaluations for a number of representative combinations and discuss them in the upcoming evaluations.

To make the evaluation tables in this section easier to read and compare, they all follow a similar structure. The table columns represent the various modes of verb extraction. They are labeled accordingly *as whole sentence, sliding window* 7 and *sliding window 5.* The sliding windows sizes were chosen to distinguish the effect of a very tight window (size 5) and an average one (size 7); we also experimented with window size 9, which gave results very similar to window size 7. Additionally, denoted by the *direction* flag, we distinguish guessing just the correct basic relation type - referred to *as non-directed* or *"direction: no"* - and choosing the correct type and direction ( *direction* or *"direction: yes").* In configuration *"non-directed"* the system does not consider the order of concepts, i.e. it does not distinguish *study*  and *studiedBy.* The table rows reflect the relation labeling methods, such as *VSM* for vector space model, or *"Baseline: KS"* for the method adopted from Kavalec and Svatec[95], and also indicate additional settings, for example that the verb extraction includes the use of prepositions following a verb (unless otherwise noted only the verbs were used). The table rows are grouped by the suggestion aggregation modes which were used in the experiments:

*(i)* involves no aggregation and selects the relation labels from the single training relations with the highest similarity score to the testing relation.

*(ii)* computes the average of similarities over all training relations of a particular predicate ( e.g. *takeActionBy),* and

*(iii)* suggests a label based on the average of the best (most similar) 30% of training relations for a particular relation type ( for more information on aggregation modes see Section 4.5.4).

The following sections present the evaluation of the VSM on the basis of three measures: the Average Ranking Precision, the percentage of correct labels on first guess and the percentage of correct labels on first or second guess.

#### **5.2.3 Average Ranking Precision**

The Average Ranking Precision **(ARP)** is the average number of tries necessary to pick the correct relation label from an ordered list of suggestions, i.e. how many picks are needed on average to get to the correct label. This measure is very relevant when supporting domain experts with relation labeling, as the ARP score reflects the manual effort needed to choose the correct label from the sorted list of candidates. The tables below present the ARP results for the VSM calculated for Gerhard Wohlgenannt - 978-3-631-75384-2 the various configuration settings and the

random baselines. The random baseline in case of configuration *directed* is 5.5 (as there are 10 possible choices), and for configuration *non-directed* the baseline is 3.0. The best ARP score is 1.0 - when all relations are labeled correctly on first guess; the worst score is 10.0 ( *directed)* resp. 5.0 ( *nondirected).* The baseline from the adopted method by Kavalec and Svatek is not applied for ARP computations, as the original method simply suggests a correct label and does not involve the calculation of a ranked list of relation labels.

#### **TF-IDF Configuration**

All results in the upcoming tables in this section rely on the use of the VSM only, with no integration of concept type information. The various tables reflect VSM variants with different configuration settings. Table 5.5 summarizes the results comparing the application of the 20 most significant verbs in the VSM *(tf-idf 20)* to the 150 most significant verbs *(tf-idf 150).* 


*Table 5.5:* ARP results comparing the VSM based on *tf-idf* with the 20 versus 150 most significant verbs

Table 5.5 shows that there are only minor differences caused by the differing vector sizes of 20 and 150. *tf-idf 150* performs slightly better when using no aggregation of training relations, i.e. for aggregation mode (i); for (ii) *tf-idf 20* has an advantage. The overall tendency observable in Table 5.5 is that the ARP result in the case of directed relations is between 2.7 and 3.0 guesses for getting the correct answer, when negating direction the scores are around 2.0 - compared to random baselines of 5.5 (3.0). A sliding window of 7 yields better ARP scores than the Gerhard Wohlgenannt - 978-3-631-75384-2 small window of 5, probably because the lower number of verbs extracted is harmful especially when relations occur rarely in domain text. The extraction mode *sentence* outperforms *sliding window* 7 for *non-directed* label suggestions in this evaluation, for directed labels the sliding windows with their narrow context seem more appropriate.

#### **Optional Prepositions**

Table 5.6 compares the VSM results for a *tf-idf 150* configuration between using plain verbs only ( *VSM 150)* and using verbs including preposition suffixes ( *VSM 150 prepo).* The later category includes phrases such as "look at" instead of single words such as "look".


*Table 5.6:* ARP results comparing the use of plain verbs (VSM *tfi-df 150)* vs. verbs with prepositions ( *VSM tf-idf 150 prepo)* 

The inclusion of *prepositions* improves ARP performance across verb extraction and aggregation strategies. But when using only 20 tj-idf-ranked verbs in the vectors (seperate table not included here for brevity), gains from prepositions are minimal since additional verb variants lead to lower overlap between such small vectors.

#### **WordNet Validation of Verbs**

This section evaluates and compares the ARP scores of unfiltered verbs versus verbs filtered by WordNet confirmation. Rows marked with *VSM 150* use all verbs identified by the annotations of the POS tagger in subsequent steps such as verb vector building - which is also the default configuration. As POS taggers typically generate a Gerhard Wohlgenannt - 978-3-631-75384-2 certain percentage of wrong tags, *"VSM* 


*confirmed"* additionally checks the verbs against WordNet and filters entities which could not be confirmed as potential verbs.

*Table 5. 7:* ARP results comparing the application of verbs confirmed with Word-Net ( *VSM confirmed)* vs. non-confirmed verbs ( *VSM 150)* 

Table 5. 7 shows that using only confirmed verbs yields no significant improvements, the ARP scores for both settings are very similar. The same observation holds when using *tf-idf 20* vectors instead of *tf-idf 150.* 

#### **Training Base Size**

As described above, the VSM-based method generates its label suggestion based on the similarity of a testing relation as compared to all training relations. This section presents experiments with reduced numbers of training vectors, and evaluates the influence of the size of the training base on the ARP results. In the default configuration the system equally splits all available relations into testing and training sets in a random process, resulting in about 45 testing and training relations per predicate - this corresponds to the rows denoted as *VSM 20 50%.* When sonly 25% of all relations are used for training purposes, about 22 training relations remain per predicate ( *VSM 20 25%).* And finally, *VSM 20 10%* provides around 9 training relations per predicate in order to observe the performance of such a small training base. As indicated by the row labels, the evaluation uses the *tf-idf 20* setting (the 20 most significant verbs per relation).

As expected, Table 5.8 shows that a smaller number of training relations heavily decreases system performance. Especially when suggesting label and direction ( *directed),* system performance Gerhard Wohlgenannt - 978-3-631-75384-2 deteriorates across verb extraction


*Table 5.8:* **ARP** results comparing the influence of various training base sizes, using a *tf-idf 20* configuration

and aggregation modes, but it seems that the negative effect is stronger in the case of the *sentence* verb extraction mode than for sliding window modes.

#### **Corpus Evidence Numbers**

Table 5.9 evaluates the effect of corpus evidence numbers on VSM performance. Corpus evidence numbers refer to the number of sentences in the corpus where the relation occurs, or more precisely, in how many sentences the lexical representations of the concepts that constitute the relation occur. Relations strongly reflected in the corpus ( and therefore represented by a big number of verbs) should have a superior performance in the presented VSM-based approach. The evaluation classifies relations into three groups: Relations with a number of 10 to 100 evidences ( *VSM 20 <100),* relations with 100 to 250 evidences ( *VSM 20 100+* ), and relations matching in more than 250 sentences in the domain corpora ( *VSM 20 250+* ). The evaluation used a *tf-idf 20* configuration for verb selection.

The assumption that a higher number of evidences leads to better results is confirmed by the experiments presented in Table 5.9. There are strong increases in ARP performance from Gerhard Wohlgenannt - 978-3-631-75384-2 *VSM 20 <100* to *VSM 20 100+,* as well


*Table 5.9:* The effect of the number of sentences found per relation on the **ARP**  results

as from *VSM 20 100+* to *VSM 20 250+.* The scores, even for *VSM 20 250+*  relations, are generally worse than one would expect, because the system also applied the evidence number filter to training relations, leading to reduced sets of training relations. Reduced training sets decrease performance (see Table 5.8) and offset the benefits of high evidence counts for the measures presented in this section. With a sufficient training base of frequently occurring relations, better results are to be expected.

#### **Predicates**

The current section presents results for individual predicates evaluated separately. Table 5.10 reports on aggregation mode (iii), suggesting relation labels based on the average of the best 30% of training relation's similarity scores. Setting *tj-idf 150* applies as verb selection configuration for the experiments.

There are substantial differences in performance between the individual relation types. The best results were achieved for the predicate pairs *use-usedBy, study-studiedBy* and *effectOn-aff ectedBy.* The approach was less satisfactory in the case of *subClassOJ-superClassOJ* Gerhard Wohlgenannt - 978-3-631-75384-2 and *takeActionBy-*


*Table 5.10:* **ARP** scores for individual predicates for Method VSM 150

*actionTakenBy.* Section 5.2.4 about evaluations for the *first guess correct*  configuration includes a more detailed discussion of individual predicate performance.

#### **Summary and Interpretation**

This section about the evaluation of the ARP measure for the VSM-based approach assesses the general performance of the method. It exemplifies the influence of various configuration settings on labeling accuracy, and examines the impact of reducing the training base or selecting only relations with a certain amount of evidence in the corpus.

A comparison of *tj-idf 20* and *tj-idf 150* reveals that only minor differences in performance exist for the two settings. Filtering verbs as identified by the POS tagger with WordNet also does not yield significant benefits. However, the inclusion of prepositions directly following verbs improves the VSM results, especially for the *tj-idf 150* Gerhard Wohlgenannt - 978-3-631-75384-2 configuration. Reducing the training base,


*Table 5.11:* Summary of the VSM results for configuration *tf-id/150,* including prepositions

i.e. the number of training relations, leads to a decrease in performance; on the other hand the accuracy of the method increases when relations are represented by a larger number of verbs - stemming from relations frequently occurring in the domain corpus.

Table 5.11 gives an example of a configuration leading to comparatively good ARP scores, the computations make use of the *tf-idf 150* verb selection setting and of verbs including prepositions. While no single configuration yields the best results for all observed measures, differences are often marginal. The VSM method attains ARP scores below 1. 7 for non-directed label suggestions for the *sentence* configuration, and around 2.7 for labels including direction.

The results clearly show the benefits of the described method, but leave room for improvement. Section 5.5 presents enhancements relying on the application of ontological restrictions in the learning process and the integration of information gained from reasoning on structured information collected from online sources.

### **5.2.4 First Guess Correct**

The First Guess Correct (FGC) measure is defined as the percentage of correct labels yielded by the first suggestion, a measure highly relevant not only when attempting to label relations automatically, but also when involving a domain expert. No manual selection of an alternative relation label is necessary in situations where the first suggested label is correct. The data tables in the upcoming evaluations follow the same structure as those in the preceeding section, with the *direction yes/no* flag and verb extraction modes as column headers, and the individual evaluations represented by rows and grouped by the aggregation modes. The random baselines for FGC are 10% for directed relation label suggestions, and 20% in case of non-directed labeling. The tables for FGC ( and also second guess correct, see below) also include the baseline scores from the adopted Kavalec and Svatek approach, marked as *"Baseline: KS".* 

#### **TF-IDF Configuration**

Table 5.12 contains the FGC values comparing the *tf-idf 20* and *tf-idf 150*  configurations. Furthermore, those results are contrasted by baselines from the adopted Kavalec and Svatek approach and a random baseline.


*Table 5.12:* Percentage of correct first guesses with the VSM-based method for configuration *tf-idf 20* and *tf-idf 150,* including baseline scores

For the *directed* evaluation setting, where the application has to choose a relation label from 10 candidates, Gerhard Wohlgenannt - 978-3-631-75384-2 the VSM obtains FGC scores up to around 42%, for *non-directed* (5 candidates) almost 69% when using verb extraction mode *sentence.* Al though *tf-idf 150* performs slightly better for verb aggregation mode (i), whereas *tf-idf 20* has a minor advantage in mode (ii), the differences are quite small. Similar to the ARP results, extracting verbs from the whole sentence matching a relation yields better results if the direction of relations is not taken into account, sliding windows offer better performance for the setting *directed.* The accuracy of the adopted *above expectation* measure was quite low. We attribute this not to the heuristic itself, but to the difficulty of transforming Kavalec and Svatek's method to our automated relation label suggestion and evaluation procedures.

#### **Optional Prepositions**

Table 5.13 displays the evaluation results regarding a variation of the *verbs*  function (see Section 4.5.1). *VSM tf-idf 20 prepositions* includes verbs and optional prepositions following the verb when computing tf-idf significances, *VSM tf-idf 20* is restricted to plain verbs.


*Table 5.13:* FGC performance of *tf-idf 20* with and without *prepositions* 

The FGC results reflect the observations from the ARP evaluation section, namely that the inclusion of prepositions increases system performance, in particular for the *directed* variants, where prepositions raise the percentage of correct suggestions up to 4%. The benefit of including prepositions is higher for the sliding window verb extraction Gerhard Wohlgenannt - 978-3-631-75384-2 modes than for the *sentence* mode.

#### **Predicates**

Table 5.14 presents the FGC results for individual predicates, separated into *non-directed* suggestions and labeling suggestions including relation direction ( *directed).* As with the ARP evaluations of predicates, we restricted the data given in the table to verb aggregation mode *(iii)* for brevity.


*Table 5.14:* FGC performance broken down to individual predicates for a *t/-id/ 150*  configuration and verb aggregation mode (iii)

Major differences in the performance of individual predicates become evident. The predicates *use* and *study* perform particularly well, the method does not seem very successful to discover the relations *subClassOJ* and *super-ClassOf* A very interesting aspect is that inverse relations such as *usedBy, studiedBy* or *afjectedBy* have lower FGC scores than their active voice equivalents. This may be attributed to the use of lemmatization techniques in the process of extracting verbs from sentences - future research will investigate this issue, and attempt to improve performance for inverted relations. Evaluations of domain expert consensus (see Section 5.1) on relation labels show that the labels *effectOn* and *takeActionBy* are often ambiguous, respectively both labels are appropriate for a Gerhard Wohlgenannt - 978-3-631-75384-2 number of training relations - this might

have a negative effect on the performance of corresponding relation pairs ( *eff ectOn, affectedBy* versus *takeActionBy,actionTakenBy);* related issues such as permitting multiple correct relation labels will be tackled by future research. For the predicate pair *subClassOJ/superClassOJ* there was a degree of non-conformance among domain experts, which helps explain the weak performance for these predicates.

#### **Summary and Interpretation**

This section assessed the evaluation data for the first guess correct measure, i.e. the percentage of testing relations where the first guess yielded by the VSM based approach is the correct one. A comparison of results from the *tf-idf 20* versus *tj-idf 150* verb selection thresholds revealed only minor differences in performance depending on the respective verb selection mode. Evaluation tables show that the inclusion of *prepositions* into verb vectors has a positive effect on FGC scores. An interesting insight is the varying performance for different predicates, which will need further attention in future research. Some predicates such as *use* and *study* provide a remarkably high accuracy, for others, especially *subClassOJ* and *superClassOJ,* the VSM based method is not very successful.

For the FGC analyses we omitted some of the experiments given in the **ARP** evaluations, namely the evaluation of the effect of using confirmed verbs only, as well as analyses regarding training base size and number of evidences per vector - the results of those evaluations were unambiguous and evident for ARP, repeating the evaluations for FGC gave no additional insights.

Table 5.15 summarizes the FGC performance of the VSM-based method on the basis of a *tj-idf 20* configuration including eventual prepositions appended to verbs.


*Table 5.15:* FGC results for configuration *tf-idf 20* including prepositions Gerhard Wohlgenannt - 978-3-631-75384-2 Downloaded from PubFactory at 01/11/2019 05:40:36AM via free access

It is evident from the evaluation data that the VSM-based method yields significant improvements over the baseline scores regarding the FGC measures. We confirmed this observation with a number of Chi-squared tests. Significance levels exceed 99.99% when for example comparing first guess correct VSM scores against the two baselines for the *directed* and *"aggregation mode: (iii)/sl. window* 7" configuration, but also for *non-directed* configurations such as *"aggregation mode (iii)/sl. window* 7".

#### **5.2.5 Second Guess Correct**

The Second Guess Correct (SGC) measure is very similar to FGC, but it is a little more relaxed in the sense that it reflects the percentage of situations where the first or the second guess in relation labeling is correct. This measure helps to assess how often the domain expert can select a relation label from a list of suggestions with minimal effort, by choosing the label from the top two suggestions on the list. The random baseline for the setting *directed*  is 20%, if direction is neglected a random baseline of 40% follows. Table 5.16 compares the *tf-idf 20* and *tf-idf 150* results for the SGC measure.


*Table 5.16:* Evaluation results based on the SGC measure

As with ARP and FGC, *tf-idf 150* performs better using verb aggregation mode *(i),* and slightly worse for *(ii}.* The same holds for a comparison of verb extraction modes: *"sentence"* is superior when no relation direction needs to be detected, in configuration *directed* sliding windows provide better results in most situations. The VSM method yields correct suggestions on second guess of over 80% for non-directed Gerhard Wohlgenannt - 978-3-631-75384-2 relations, and up to 64% for directed


relations. The use of *prepositions,* similar to evaluations for ARP and FGC, provides an additional performance increase, see Table 5.17.

*Table 5.17:* SGC for *tf-idf 150* verb selection including *prepositions* 

After presenting the evaluation results for relation label suggestions based solely on the VSM, the remainder of the evaluation section will focus on results from the acquisition and integration of semantic information about concepts and on the application of ontological restrictions.

## **5.3 Concept Grounding**

The linking of concepts to types according to the classification meta ontology, also referred to as concept grounding, is a prerequisite for improving the **VSM** approach with semantic inference and validation. The methods for concept grounding as described in Section 4.5.2 (SPARQL queries against DBpedia, ontological reasoning) were applied to the 168 concepts included in the relations used in the experiments. Table 5.18 lists the grounding results and distinguishes two main categories: The category *grounded* refers to concepts where the grounding process succeeded, the procedure yielded correct and also some incorrect concept types. For some concepts the current grounding methods were not sufficient to determine a concept type, those are categorized as *not grounded.* 


*Table 5.18:* Success of concept grounding for all 168 concepts Gerhard Wohlgenannt - 978-3-631-75384-2 Downloaded from PubFactory at 01/11/2019 05:40:36AM via free access

117 of the full 168 concepts were grounded to a concept type from the classification meta ontology. But the grounding components did not correctly identify all concept types - as compared to manual assignment by a domain expert. Some of the links from DBpedia to OpenCyc are dubious, for example the concept *bus* was grounded wrongly, because DBpedia has an owl: sameAs link to OpenCyc's *bus line* (http: **//sw.** OpenCyc. org/ 2008/06/10/ concept/Mx4 ... ) . This *bus line* concept is a subclass of *tmnsportation organization* and the reasoning process maps it to *Organization*  although the correct mapping would have been *ObjectTopic.* Another major source of problems are DBpedia redirects such as from *activist* to the page on *activism.* Following those redirects leads to wrong grounding, in this case *AbstmctTopic* instead of the correct concept type *Person.* In order to raise recall of concept grounding the use of redirects was retained despite occasional errors. Those cases of wrong grounding influence the relation labeling method's results negatively, since no concept type information at all is better than an incorrect classification. But we expect such problems to be lessened with future releases of external services and also by integrating more evidence sources in the grounding process, such as the YAGO ontology and SKOS vocabulary (see below). The rate of wrongly mapped concepts in all grounded concepts is 6% (7 /117) - overall the precision of the automatic type classification is therefore quite satisfactory.

51 of 168 concepts could not be grounded into the classification meta ontology, either because no entry existed in DBpedia, or the entry did not provide the needed information for grounding. The *no DBpedia entry found*  field in Table 5.18 refers to concept labels for which no DBpedia page exists, which was the case for ten concepts. Examples of such terms are *oil demand, combustion process, environmental problem* or *low-emission car.* When relying on DBpedia for concept grounding, this problem calls for additional methods such as the acquisition of synonyms or term resolution techniques (see e.g. Wong et al. [196]). For the remaining 41 concepts falling in the category *no path to a matching concept* the application located a DBpedia page, but the page did not provide sufficient information for our current concept grounding approach. Those pages did not include links to Open-Cyc or to the DBpedia ontology, or those links did not contain appropriate information for type detection. Examples of pages which give only few structured information are http://dbpedia.org/page/Photovol taic\_eff ect or http://dbpedia.org/page/F.mission. Many of the pages which yield few structured information are annotated with the SKOS vocabulary8 in the form of skos: subject properties, some also have links to the YAGO ontology. The

<sup>8</sup> http://vww.w3.org/TR/skos-primer Gerhard Wohlgenannt - 978-3-631-75384-2 Downloaded from PubFactory at 01/11/2019 05:40:36AM via free access

exploitation of this data will be the focus of future research. With the evolution of DBpedia we also expect more structured information to be available per entry, which will increase recall and precision of the presented grounding methods. In some cases the online sources returned the Wikipedia/DBpedia disambiguation page for a term, for example for *Administration* and for *Creation.* Disambiguation methods to tackle this problem are also part of future research.

Table 5.19 presents the results of grounding according to the concept types from the classification meta ontology. The column *system results* gives the results of the automatic grounding processes, whereas the *manually assigned* column contains the outcome of manual classification by a domain expert. The QN ame *cl:* refers to the classification ontology throughout this section.


*Table 5.19:* Concept grounding per classification type, results from the grounding component versus manual assignment

The data in Table 5.19 reveals that for the present domain most concepts (around 65%) are in the class *ObjectTopic.* This is no surprise, as terms related to objects such as *CO2, biofuel, truck, glacier,* etc. have an important role in the domain. However, having an uneven distribution of concepts negatively affects the discriminative power of the classification schema and overall system performance. In the current evaluation some of the concepts classified as *Person* and *Organization* took part in multiple relations, which in a way leads towards a re-balancing of the concept type distribution. It is also evident from comparing the results of automatic and manual grounding that concepts from the class *AbstractTopic* are difficult to detect automatically, many were labeled as unknown. Those concepts typically include less structured information on Wikipedia (in the form of infoboxes), tend to be ambiguous, and are harder to grasp with simple queries against the DBpedia graph.

Another point of view is on the methods (DBpedia queries, reasoning) applied in the process of concept grounding. 66 of the 117 concepts were grounded by the use of ontological reasoning, the remaining 51 with DBpedia queries. Among the 66 concepts Gerhard Wohlgenannt - 978-3-631-75384-2 where reasoning yielded a result, 58

times links to OpenCyc provided the information needed, 5 times both links to OpenCyc and the DBpedia ontology contained the relevant clues, and only 3 times reasoning was done by links to the DBpedia ontology alone. In conclusion, reasoning was mainly facilitated with the help of OpenCyc concepts linked in DBpedia pages found for the concept labels.

The creation of the classification ontology in combination with the definition of ontological restrictions for the set of predicates involved some manual effort. Even more work was needed to set up the concept type mappings between the classification ontology and sources such as OpenCyc, as well as for the definition of SPARQL queries on the DBpedia graph (for query based grounding). The presented method is appropriate especially for relation learning tasks where parts or all of this information can be reused, i.e. which aim at detecting similar predicates and similar concept types. If the approach is applied in various domains there **will** soon be a pool of available classification concepts (including the mapping information into external ontologies) and relations defined upon those concepts. This will enable the ontology engineer to increasingly rely on existing definitions. Another way to reduce human effort attempts to create links between the classification meta ontology and external ontologies with ontology alignment strategies.

## **5.4 Scarlet**

Scarlet [152] provides a method to discover relations between concepts based solely on data from Semantic Web sources. Section 4.3.5 gives a more detailed theoretical description of Scarlet, Section 4.6.4 contains the implementation details. We applied Scarlet to all 626 relations (training and testing relations) existing in the evaluation database by calling it with the two concept labels participating in each relation. Only for 10 out of these 626 input items Scarlet returned a predicate suggestion, 8 out of the 10 relation labels were among the predefined labels used in our architecture. We configured Scarlets *RelationFinder* class towards high recall by searching into multiple ontologies and considering inherited relations to a depth of six links. Table 5.20 gives an overview of relations labeled by Scarlet. For all other relations Scarlet yielded no label suggestions.

The first column in Table 5.20 states whether the suggestion returned by Scarlet is among the relation labels defined in our system. If Scarlet returned multiple labels, then we chose the one known to our system, if any. A case of conflict, i.e. were Scarlet suggested more than one label from our list, never occurred. For the relation *scientistttperson* Scarlet suggested a number of predicates, among which were Gerhard Wohlgenannt - 978-3-631-75384-2 *subClass,* but also the named relations


*Table 5.20:* Results of calls to the Scarlet API with all relations

*hasProfession9* and *spouse10 .* The last column in the table refers to whether Scarlet's suggestion was correct. For results not among our predefined labels we do not assess correctness as those suggestions have no relevance for the present architecture. 4 out of 8 suggestion by Scarlet were correct as verified by manual assessment, which results in a precision of 0.5. Recall was obviously very low, we attribute this to the knowledge acquisition bottleneck described in Section 2.1.1, and expect that recall will improve considerably with the growth of the Semantic Web. The incorrect results for the relations *oil+-+industry* and *coal+-+industry* are attributed to inaccurate relation labels in the underlying ontology, which used *oil* ( and *coal)* in the sense of *oil ( coal} industry.* This problem, and other reasons for inaccurate labels returned by Scarlet, are described by d'Aquin et al. [47].

The original goal was to integrate Scarlet into the relation detection framework, but due to the low recall that Scarlet currently provides for the examined domain relations, Scarlet was not considered in the final evaluations provided in this thesis. Experiments show that Scarlet has no significant impact on the evaluation measures at the current state. Future work regarding the integration of Scarlet needs to address the non-trivial task of mapping named relations returned by Scarlet onto predicates used by our architecture.

<sup>9</sup>http://vwv.ontotext.com/kim/2004/04kimo#hasProfession

10http://paoli.open.ac.uk/watson-cache/f/48d/[ Gerhard Wohlgenannt - 978-3-631-75384-2 ... ]b58b52e768#spouse Downloaded from PubFactory at 01/11/2019 05:40:36AM via free access

## **5.5 Evaluation of Integrated Data Sources**

This section presents the evaluation results for the enhanced VSM, which re-ranks similarity results according to the conformance of concept's type information with ontological restrictions from the classification meta ontology. Section 4.5.2 gives a formal description of the method. The evaluation data tables have a similar structure as for the plain VSM in previous sections. The label *SJV* (Semantic Inference and Validation) marks the results which are based on the re-ranked VSM method. The raw VSM data, which is identical with the experiments given in Section 5.2, provides a baseline score in this section, and is denoted as *"Baseline: VSM".* The verb aggregation modes (i), (ii) and (iii) remain the same, the verb extraction modes are reduced to *sentence* and *sliding window* 7. The measure in parenthesis represents the results of a second set of evaluations restricted to relations where grounding was successful for at least one of the two concepts, i.e. grounding detected a type other than *cl:Unknown.* The additional filter reduces the number of relations to 437 ( from the original 461); the purpose of the new measure is to give accuracy values only for relations were some sort of semantic validation could be applied.

Similar to the evaluations in Section 5.2 this section starts with the presentation of results for the various performance measures: Average **Ranking**  Precision (Section 5.5.1), first guess correct (Section 5.5.2) and second guess correct (Section 5.5.3). Section 5.5.4 provides some interesting findings regarding the performance of individual predicates. Finally, Section 5.5.5 draws conclusions upon the summarized results, provides significance information for the SIV method, and also gives a brief overview of evaluation results of other relation labeling methods found in literature.

#### **5.5.1 Average Ranking Precision**

Table 5.21 includes the results for the *SIV* measure for a configuration with *tf-idf 150* verb vectors. The benefit gained from semantic inference and validation is apparent compared to the VSM baseline. The system now achieves *non-directed* ( *"direction: no")* ARP scores of around 1.5, for relations including direction results of circa 2.0. Relations with at least one concept grounded (the values in parenthesis) obtain an additional increase in performance.

The use of *prepositions* in verb vectors, as presented in Table 5.22, provides another slight improvement in terms of ARP, pushing the scores clearly below the 1.5 value when ignoring direction, and below 2.0 for directed relation label suggestions. For *non-directed* the SIV measure, in the configuration with a *sentence* verb extraction mode Gerhard Wohlgenannt - 978-3-631-75384-2 and aggregation mode (i), yields the


*Table 5.21:* ARP evaluation results for the VSM combined with semantic inference and validation (SIV)

best performance, for *directed* sliding windows and (iii) are most suitable. The random baselines from Table 5.21 apply here as well.

#### **5.5.2 First Guess Correct**

Table 5.23 provides the first guess correct evaluation results for the *tj-idf 150* configuration, with best scores being about 74% correct first guesses for setting *non-directed,* and about 50% in directed mode ( *"direction: yes").*  Similar to the observations made in the previous VSM experiments, and also for SIV evaluations of the ARP measure, *sentence* verb extraction leads to better *non-directed* results, and sliding windows are superior for directed relations. The improvements introduced by SIV are stronger for directed relations, as directions are an integral part of domain and range restrictions.

#### **5.5.3 Second Guess Correct**

When considering the constellation where the first or second guess of a relation label needs to be correct, the accuracy for relations that include direction is around 78%, which means that in most cases a domain expert relying on the method can very quickly assign the correct relation label from two alternatives. The random baseline scores (omitted from Table 5.24 for brevity) are 40% in *non-directed* detection, Gerhard Wohlgenannt - 978-3-631-75384-2 and 20% for *directed.* 


*Table 5.22:* ARP results for the SIV method, including *prepositions* using a *tf-idf 150* configuration


*Table 5.23:* FGC performance comparing SIV with VSM and the two baseline scores


*Table 5.24:* SGC performance comparing SIV method to the plain VSM

#### **5.5.4 Individual Predicates**

Tables 5.25 and 5.26 contain the **ARP** and FGC results for evaluating the performance of all predicates individually. The predicates are referenced by their database IDs in the data tables, Section 5.2.4 and subsequent explanations provide information about the mapping to the corresponding labels.

The observations made in the VSM evaluation section (Section 5.2.4), especially that there are sometimes substantial differences in performance between active and passive voice of a single predicate, are still valid when integrating semantic inference and validation, compare for example results for *use/usedBy* or *effectOn/affectedBy.* Those gaps between active and passive voice performance are generally more pronounced when using sliding windows.

Predicates such as *study,* which include clearly defined domain and range restrictions, perform particularly well. The *study* predicate is defined with a subject domain of *(Person, Organization)* and an object range of ( *Object-Topic, AbstractTopic)* - *study* yields first guess correct values of up to 90%, even for the *directed* setting. Interestingly there is little difference between *directed* and *non-directed* for *study,* presumable because domain and range are clearly defined, so a relation with concepts in interchanged order will be in conflict with those domain and range restrictions, and be re-ranked towards the end of the list. On the other hand, for *sub Class Of /superClassOJ,*  where there are no clear domain and range restrictions (but rather property restrictions) defined, there is in many cases almost a doubling of numbers from *directed* to *non-directed.* Generally Gerhard Wohlgenannt - 978-3-631-75384-2 results for *subClassOJ/superClassOJ* 


*Table 5.25:* ARP scores for individual predicates for SIV method with a *tf-idf 150*  configuration and verb aggregation mode (iii)


*Table 5.26:* FGC scores for individual predicates, verb aggregation mode (iii)

are characterized by a big variety in accuracy depending on the settings, but on average the predicate pair yields a lower performance than other predicates. In comparison to the rather moderate accuracy levels provided by the predicates *takeActionBy/actionTakenBy* in the VSM-only evaluations in Section 5.2.4, SIV results show major improvements, especially for *directed*  configurations. *takeActionBy/actionTakenBy* has tight domain and range restrictions similar to *study/studiedBy.* 

The **ARP** and FGC scores given in parenthesis provide surprising results to some extent. Due to the smaller number of relations, the removal of a few relations may have a strong impact on the evaluated measures - as evident from the data presented. In some cases the results for the *one concept grounded* evaluations are even worse then the results for all relations, obviously the relations removed were amongst the ones labeled correctly. On the other hand the accuracy of some predicates such *subClassOJ* sharply increases with at least one concept grounded in most configurations, e.g. the FGC measure rises from 55% to 68.6% for the *directed* and *sliding window* 7 configuration.

#### · **5.5.5 Summary and Interpretation**

Concluding the description of experiments conducted, this section summarizes the evaluation results for the SIV approach. Table 5.21 compares the ARP measure for SIV and VSM-only and shows the clear benefits of SIV. Table 5.22 demonstrates that the inclusion of prepositions provides another slight improvement to ARP scores - also for the SIV method. The first guess correct results for SIV are around 74% for the setting *non-directed* and about 50% in directed mode, see Table 5.23. When considering first or second guess correct, accuracy goes up to around 78% for directed relations, so a domain expert can choose the correct label very quickly in most cases (Table 5.24). Finally, the Tables 5.25 and 5.26 give evaluations for individual predicates, showing strong differences in performance amongst the predicates. Relations with tight ontological constraints perform well ( e.g. *study* with ca. 90% correct on first guess for directed relations), for the *subClassOJ/superClassOJ*  relation pair the method is less successful.

Amongst the general conclusions from the evaluation of the SIV based approach are:

• The integration of semantic inference and validation into the original VSM model provides strong benefits as compared to the VSM-only approach, especially for relations Gerhard Wohlgenannt - 978-3-631-75384-2 that consider relation direction.



*Table 5.27:* First guess correct results, with verbs including *prepositions* 

It is obvious from the significance values presented in Section 5.2.4 that the results for the VSM integrated with concept type information are highly significant compared to the KS and random baselines - as even the results for the raw VSM yielded significance levels above 99.99%. Therefore, the most interesting aspect to determine is if the integration of structured information and ontological reasoning provides a significant benefit over the plain VSM based method. A x2 test comparing for example the first guess correct scores for the settings *directed/(i)/sl. window* 7 is significant at the 0.01 level, the same holds for the *non-directed* equivalent, which shows that SIV indeed provides statistically significant improvements.

The SIV method is especially valuable when detecting *directed* relation types, for *non-directed* relation types Gerhard Wohlgenannt - 978-3-631-75384-2 it can even have a negative effect in

*Figure 5.1:* Comparison of VSM and SIV method and baselines, FGC results for sliding window size 7

some constellations, as domain and range restrictions enforce the correct order of concepts - training relations with the correct basic relation type, but the wrong direction will be filtered or penalized. The evaluation results reflect this diagnosis, as *SJV* yields higher percentage gains for the *directed*  setting.

Figure 5.1 provides a graphical overview over the results of the implemented methods with the help of data extracted from Table 5.23. It shows the FGC result for configuration *sl. window* 7 for the two baseline scores, as well as for the VSM and SIV method.

**Fl Score.** The *Fl score* (also known as F-score or F-measure), a measure very common in information retrieval, considers both precision and recall, an is computed as follows:

$$F1 = 2 \cdot \frac{(precision \cdot recall)}{precision + recall} \tag{5.1}$$

The accuracy of 7 4.84% (Table 5.27) correct label suggestions on the first guess (87.45% for the second guess, Table 5.24) corresponds to an Fl score of 0.86 (0.93) in a *non-directed* setting. For a *directed* setting, with 10 relation labels to choose from, the maximal Fl scores are 0.69 for first guess correct, and 0.88 for second guess correct. Gerhard Wohlgenannt - 978-3-631-75384-2 For relations where at least one concept

could be grounded, i.e. the concept has a type other than *cl: Unknown,* the respective Fl values in the *directed* setting are slightly higher with 0. 70 for first guesses, and 0.89 for second guess correct.

**Results Reported in the Literature.** Table 5.28 gives an overview of the *a posteriori* accuracy of relation detection methods in literature. It is important to note that those methods and hence the results cannot be directly compared to the approach presented here, as they involve completely different corpora, evaluation methodologies and settings.


*Table 5.28:* Approaches to relation detection [190], the accuracy in the case of Rinaldi et al. varies by corpus and relation type

Ontology learning methods, and especially relation detection approaches, are hard to compare for various reasons. On the one hand there is not much consensus in the ontology learning community upon the concrete tasks of ontology learning [37]. Evaluation results also differ substantially depending on whether an *a posteriori* or an *a priori* evaluation was used. *A priori*  evaluations are based upon a gold standard built independently of the system to be evaluated - so the system is evaluated against this gold standard in a strict way [37]. The advantages of *a priori* evaluations are that they can be done automatically, and that they are independent of human assessment. The major drawback originates from the fact that a real world domain can be modeled in many different ways, so results from the evaluated system that are reasonable not necessarily correspond to the gold standard. On the other hand in *a posteriori* evaluations the evaluator (e.g. a domain expert) manually assesses the results of the system. The drawbacks are the need for manual effort and that the evaluation depends on how inclined the evaluator is to regard the suggestions of the system as correct. Results, for example regarding the average precision of a method, can be 10% higher if evaluated *a posteriori* [164]. The present work uses an *a priori* evaluation in the sense that we determine the correct relation labels before applying the method.

#### **Assumption of Correct Grounding**

To show some of the potential of the SIV method, this section presents evaluation results for semantic inference Gerhard Wohlgenannt - 978-3-631-75384-2 and validation under the assumption

that all concepts were grounded correctly. For this scenario we manually set the concept types to the correct values according to the classification meta ontology to simulate the improvement capabilities of enhanced grounding techniques and extended external datasets. Table 5.29 gives first guess correct results for the configuration *tf-idf 150* and verbs that include prepositions (setting *prepositions).* 


*Table 5.29:* First guess correct results under the assumption that all concepts are correctly grounded

Correct grounding of all concepts would yield an additional improvement of 4-8 percentage points on the first guess correct scores, as shown in Table 5.29. The numbers imply that improving the current grounding procedure is certainly helpful in order to sharpen the method. However, also the results from the corpus-based methods (vector space model) need to be enhanced in future work to raise the accuracy of the approach, as well as the strategies to leverage online structured information.

After presenting the results of extensive evaluation procedures in the current chapter, Chapter 6 will conclude the thesis with a summary of the main aspects of the presented work, recapitulate the observations and learnings from the experiments, and suggest future work to enhance the presented methods.

# **Chapter 6 Conclusions and Outlook**

After laying the theoretical foundations in the first three chapters, and then presenting and evaluating the relation label suggestion methods in Chapters 4 and 5, this chapter concludes the thesis summarizing the approach and experiments conducted. It highlights the main contributions, and outlines the most promising areas for future research.

**Summary.** In ontology learning, the task of labeling non-taxonomic relations in domain ontology is among the most difficult and least tackled problems [95]. The presented approach introduces a set of methods to address this issue. This approach combines corpus-based methods, which have domain text as their only source of input, with a technique to validate ontological restrictions relying on knowledge inferred from Semantic Web information sources. The corpus-based methods utilize verbs co-occurring with the respective relations in vector space models to calculate the similarity to known relations. Based on the similarity values, the algorithms refine the relation labeling results by validating the conformance of the entities involved against ontological restrictions defined with the help of a meta ontology. The crucial ingredient in this process is *concept grounding,* i.e. the task of linking the concepts from the domain ontology into the meta ontology in a procedure that includes reasoning techniques with external data sources, such as DBpedia and OpenCyc.

An extensive set of experiments helped to assess the performance of the presented approach. Training and testing relations were labeled with one of five basic predicates. When distinguishing the correct predicate and the direction of the relation, this resulted in ten relation label candidates. The evaluation metrics of Average Ranking Precision, first guess correct and second guess correct, were applied to evaluate different configurations of the relation labeling method. The method yields an accuracy of 53% correct Gerhard Wohlgenannt - 978-3-631-75384-2

suggestions on first guesses regarding relation type and direction. When ignoring direction the accuracy increases to 75%. The average position of the correct label in the list of label suggestions is about 2.0 with ten candidates, and slightly below 1.5 when neglecting relation direction.

The evaluation results fluctuate depending on the configuration used by the architecture. Some of the settings had no consistent positive or negative effect on performance, for example the inclusion of the 150 most significant verbs per relation in the verb vectors versus the 20 most significant, or confirming verbs with WordNet - the outcome depends on the remaining evaluation metrics and configurations chosen (as outlined in Chapter 5). Other settings, like the optional use of prepositions occurring directly after verbs in text, consistently yielded positive effects. The evaluations also demonstrated that a large quantity of training relations, or a high number of sentences from the corpus where individual relations match, positively impact the evaluated metrics.

The experiments revealed substantial differences in performance between individual predicates. Predicates that caused few disagreements between domain experts when manually labeling training relations perform better. The same observation holds for predicates which include clearly specified and tight domain, range and property restrictions. For some predicates (e.g. *study),*  the presented algorithms reached first guess correct results of around 90% when choosing relation labels from ten alternatives.

A comparison of evaluation results between the methods presented in the thesis and two baseline scores illustrates highly significant gains in performance. For this purpose a random baseline and a baseline adopted from the literature were used.

The experiments also demonstrated the significant benefits achieved by the integration of knowledge inferred from external structured sources in the relation labeling process, as compared to relying on corpus-based methods only. However, current online datasets and ontologies involve certain data quality issues outlined in Section 5.3 ( e.g. DBpedia redirects such as *activist*  -+ *activism,* resulting in wrong concept grounding). But the advantages of incorporating external sources will increase over the next years, with as more and more linked data being made available online.

**Main Contributions.** In summary, the thesis *contributes* to compiling a common body of knowledge and advancing the state of the art by:

• Introducing a novel approach for the ontology learning task of labeling non-taxonomic relations. The thesis demonstrates the accuracy of the approach to learn specific relations, Gerhard Wohlgenannt - 978-3-631-75384-2 and compares it to state-of-the-art

techniques (although the various methods are not directly comparable due to different evaluation methodologies, numbers of relations to learn, underlying datasets, etc.).


**Future Research.** Despite these advances, there are a number of open issues that will require further attention. The following paragraphs outline major lines of *future work* to tackle some of these issues.

On the one hand, the current implementation assumes exactly one relation label to be appropriate for the relation between two concepts. Future research **will** investigate the implications of allowing multiple labels per relation. On the other hand, there are cases where none of the predefined labels is suitable, thresholds on similarity values will detect such situations. The author plans to determine the performance impacts and other consequences of raising or reducing the number of predefined predicates, as well as to apply the relation labeling architecture in other domains.

Several ideas have come up in the course of the present work on how to improve the performance of the corpus-based Gerhard Wohlgenannt - 978-3-631-75384-2 methods:


An important line of development focuses on the improvement of concept grounding. The integration of additional sources, e.g. the Wikipedia category system which is represented by the SKOS vocabulary and the YAGO [180] classification schema, will help to raise the methods' recall. If grounding still fails, methods for the acquisition of synonyms or term resolution procedures (such as [196]) should be integrated. Disambiguation techniques to find the appropriate meaning of a term **in** cases where DBpedia returns disambiguation pages for input concept labels **will** address a similar problem. With the availability of additional structured data, advanced conflict resolution and mediation techniques will become an essential component of the refined grounding strategies.

For the practical application of the proposed methods it is crucial to reduce human effort involved in creating training relations and ontological definitions. Future work will comprise bootstrapping techniques (e.g. [52]) to support the automatic creation of training relations for particular relation types (predicates). Instead of defining domain and range restrictions manually, either existing specifications should be re-used, or mechanisms applied to learn the restrictions from existing training relations. After grounding concepts from training relations, the system can detect the appropriate restrictions automatically. It is presumably more effective to use some probabilistic model to specify and to validate Gerhard Wohlgenannt - 978-3-631-75384-2 ontological restrictions if they are

learned automatically, because concept type information gained from concept grounding includes a certain amount of misclassification. Next to the construction of training relations and the definition of ontological restrictions in the classification meta ontology, the specification of links between concepts in external ontologies (for example OpenCyc) and concepts from the meta ontology still requires significant human effort. Ontology mapping techniques that exploit lexical similarity could be used to automatically propose such links.

## **Bibliography**


sion of rote extractors. In *Proceedings of the 2nd Workshop on Ontology Learning and Population: Bridging the Gap between Text and Knowledge,* pages 49-56, Sydney, Australia, July 2006. Association for Computational Linguistics.


biology ontology. In P. Buitelaar and P. Cimiano, editors, *Ontology Learning and Population: Bridging the Gap between Text and Knowledge,* volume 167 of *Frontiers in Artificial Intelligence and Applications,*  pages 91-103. IOS Press, Amsterdam, Netherlands, 2008.


inductive logic programming. *Journal of Machine Learning Research,*  4:493-525, 2003.


Wikis. 3rd European Semantic Web Conference (ESWC2006), June 2006.


ontologies. Technical report, University of Karlsruhe, Institute AIFB, 2003.


*Ontology Learning,* volume 31 of *CEUR Workshop Proceedings.* CEUR-WS.org, 2000.


*Asia Conference on Knowledge Discovery and Data Mining,* pages 277- 288, Bangkok, Thailand, 2009.


#### **Forschungsergebnlsse der Wirtschaftsuniversltiit Wien**

Herausgeber: Wirtschaftsuniversitat Wien vertreten durch a.o. Univ. Prof. Dr. Barbara Sporn

#### INFOS UNO KONTAKT:

WU (Wirtschaftsuniversitat Wien) Department of Finance, Accounting and Statistics Institute for Finance, Banking and Insurance Heiligenstadter Straf3e 46-48, 1190 Wien Tel.: 0043-1-313 36/4556 Fax: 0043-1-313 36/904556 valentine.wendling@wu.ac.at www.wu.ac.aVfinance


www.peter1ang.de